Summary: This post discusses the annoying “Synthetic Sensor Input” (SSI) problem, i.e. the fact that it is very hard to synthesize realistic, synchronized streams of sensor inputs (e.g. Video+LiDAR+Radar). It explains why the SSI problem is a pain for Autonomous Vehicles verification (and for other things), and talks about the (imperfect) solutions.
Let me start by saying that I am not an expert on this problem. In fact, I wish it would go away, so I could get back to full-system, scenario-based, coverage-driven AV verification (as described elsewhere in my blog). But it refuses to go away, so I had to research it and think about it. Here is what I came up with – comments are very welcome.
The problem can be described as follows: Even if you have a good random-scenario-creation system which can produce scenarios like “at time X there is a car going in direction Y at velocity Z, and a person crossing, and so on”, it is very hard to transform that into realistic, synthetic, synchronized input streams (e.g. Video+LiDAR+Radar). And most AVs nowadays use these sensors (and sometimes Sonar). LiDAR, as you probably know, is the Laser Radar which creates a three-dimensional picture of the world – it is that turning thing on the top of most AVs.
This may sound like a low-level, unimportant technical problem, but unfortunately it has fairly large implications. To understand why, consider figure 1 below:
Part (a) shows how CDV-based AV verification should work: The Verification Environment (VE) executes scenarios, feeds inputs into the DUT and observes its reactions, producing coverage information and (potentially) error messages. This whole thing can execute in one of several execution platforms (SW in the loop, HW in the loop, test track etc. – more on this below). The only problem is the small green module, as we’ll see when we zoom in:
Part (b) shows one possible detailed view of the VE+DUT (ignore the pink “Shortcut” blocks for now). As we can see, the scenario execution module produces objects (other cars, people, cats etc.) with positions, velocities and so on. The green “objects to sensor inputs” module translates those into sensor inputs (say Video, LiDAR and Radar streams), which get sent to the DUT. In the DUT, a combination of actual sensors and a sensor fusion module (often ML-based) produce a bunch of estimated objects (i.e. the AV’s view of the actual objects), and the rest of the AV logic takes it from there.
A neat picture, except that nobody knows how to do a good “objects to sensor inputs” module, which will make the sensors “see” the objects exactly as they would be seen in real life. So this pesky problem (which, for lack of an official name, I call Synthetic Sensor Inputs or SSI) is a major stumbling block for good AV verification.
While this post is about AV verification, SSI is a general problem for verifying Intelligent Autonomous Systems: Autonomous robots, drones etc. all take similar sensor inputs and thus have similar issues. And (as I’ll discuss in the last chapter) this problem goes even beyond IAS verification.
So we clearly need solutions:
So what do people do? I mean, they still need to verify their AVs somehow (and re-verify them for every SW release). There are several solutions (which tells you right away that there is no single good solution).
The main options for handling the SSI problem (details below) are:
- Use an object shortcut
- Use synthetic inputs nevertheless
- Use recorded data
- Use a 3d model
- Use ML techniques
- Use actual objects on test tracks or city streets
Obviously, not all options apply to all execution platforms (e.g. the last option applies to test tracks and street driving, but not to simulation). I’ll discuss execution platforms in detail in the next chapter, but I’ll mention one point right now: If the sensors themselves are also simulated (e.g. in the SW-in-the-Loop platform), then there is the added difficulty of simulating them correctly, and this interacts with the SSI problem.
Here are the options in full, gory detail:
Use an object shortcut: This option is indicated by the pink “object shortcut” block in figure 1. The idea is to skip the the sensor-inputs, sensors and sensor-fusion modules, and simply set the estimated-objects to be the original objects, slightly-transformed by some “noise” function.
The problem with this method is that it ignores the sensors and sensor fusion modules – often the modules you worry about the most. Nevertheless, you can still do a lot of verification this way, testing the planner and the rest of the car. Also, remember that the scenario execution module produces other inputs (in addition to the objects sent to the sensors): Human-grabbing-control events, component-failure events, GPS-signal-lost events, car-to-X communications events and so on.
You can also simulate various worst-case scenarios by playing with the noise function: Increase the (normally Gaussian) noise, consider other kinds of noise (e.g. what happens if mud sticks to the sensor), etc..
“Object shortcut” is simply the classical trick of “stubbing out” some modules in full-system tests. This is considered a good start (assuming you also test the modules separately), but not enough.
Use synthetic inputs: Modern video games often produce completely synthetic (invented) graphics which looks pretty good, so you would think it should be easy to script a video game engine (e.g. Unity) to display the AV scenarios you want.
This works, but has several problems: Game engines don’t have facilities to create LiDAR and Radar streams. Also, the meaning of “realism” is different for AV verification.
Note, however, that the huge sums currently invested in VR / AR (by Unity and everybody else) mean that game engines will constantly improve at combining recorded data and synthetic stuff – see the next two options.
Use recorded data: Rather than trying to synthesize, you can simply record the actual input streams from an actual car, and replay them against a simulation. I talked about using recorded data in previous posts.
By definition, recorded data is pretty realistic. On the other hand, it has lots of limitations:
- You can only replay what you recorded (with some minor modifications – see below). If you never recorded a hilly drive, there will be no hilly drive.
- If the recording car turned left but the DUT decided to turn right (remember – AVs behave probabilistically), you are stuck
- Recorded data will probably not have too many dangerous corner cases (because it was recorded during a live drive)
- Data may need to be re-recorded for different sensor configurations: Ford now has two LiDARs just above the side-view mirrors, and the new Waymo (ex-Google) car now has three different kinds of LiDAR
- You need to label recorded data (“This is a left turn”, “This is a junction”) before you can use it in scenarios
What kind of minor modifications are possible? It seems you can insert e.g. 3D people into a recording of a street. Here is a description of how it is done for Video, taken from this paper:
The 6-DoF motion of the camera and the surrounding 3D scene is reconstructed from an image sequence of real recorded images. Virtual pedestrians are then animated in the reconstructed scene. Augmented images, combining the real image background with the added virtual agents, are generated using a photo-realistic rendering engine, including light simulations.
With significant effort, you can also do other modification: This research paper talks about “shifting” the sensor streams so as to make the DUT “closer” to other vehicles. I don’t know how far you can push those modifications (e.g. whether you can stitch together recorded sequences), but I guess this gets progressively harder. Which bring us to:
Use a 3D model: This option is a middle ground between recorded and synthetic data (in fact, some people might to call it “synthetic done right”).
The idea is as follows: As I said above, if you want to modify recordings, you need to reconstruct the 3D scene from them. Taken to its logical conclusion, why not create full, 3D models of entire areas / cities (based on recordings)? Combine that with 3D models of cars and people, and you can create arbitrary 3D scenarios (restricted by the set of areas you have recorded and 3D modeled).
You can probably replay these 3D scenarios in real time, while creating “reasonable quality” synchronized input streams for the various sensors. Those streams are not perfect (e.g. Radar reflections are somewhat “flat”, and there are other issues), so they are not good enough for true testing of corner cases of the sensors, but they are hopefully good enough for testing corner cases of the whole system.
BTW, here is one unexpected (for me) issue 3D-model-based-SSI needs to solve: To make it all realistic, 3D models add random “patterns” to walls and streets. However, some navigation techniques (e.g. SLAM) depend on the imperfections in the street remaining similar on subsequent visits to the same street, so now the random pattern generator has to ensure that.
For both the true-recorded and 3D-model techniques, you may wonder how the input streams are actually fed into the DUT. There seem to be multiple ways of doing that: Simulation platforms are more flexible, but actual HW sensors present problems. For instance, in a HW-in-the-loop platform which includes the actual Video camera, you may be able to surround the DUT with a big screen on which you display the input. This is much harder for an actual LiDAR: You’ll need to construct a set of LEDs for sending the recorded return beam in the exact timing – a pretty tough job.
So quite often, when people talk about playing those input streams, they actually mean something like the “data shortcut” block in figure 1: For each sensor, skip the “front end”, and go directly to “pixel data” (with some noise added). Note that I use the term “pixel data” loosely: e.g. for LiDAR “pixel data” is really depth information per scanned point. Note also that, unlike the “object shortcut”, this “data shortcut” keeps some of the sensors logic, and all of the sensor fusion logic.
There are a bunch of companies in this area of creating very accurate 3D models from LiDAR/Video data (GeoSim and rFpro are two examples – I think rFpro will also sell you “plain” recorded streams). Some people are hoping to create (lower-quality) 3D models of “any place on earth” using e.g. Google Street View.
Use ML techniques: Machine Learning is everywhere nowadays. For instance, sensor fusion modules are often ML-based, so the people who create them are (unsurprisingly) tempted to hit the SSI nail with the hammer they know best. But how?
I talked about this in the post Using ML to verify ML. Here is what I said about GANs:
One idea for making scenario generation easier is to use Generational Adversarial Networks (GANs). A GAN works as follows: Suppose you already have an ML classifier C (say a classifier from an image to the label “cat”). You now create an ML generator G, whose job is to produce images which can fool C into classifying them as a cat. Over time, C improves (i.e. gets better at distinguishing cats) and G improves (i.e. gets better at faking cats).
People have done amazing things with GANs. For instance, this paper combined a GAN with a Recurrent Neural Network (RNN) to create a system which generates images (e.g. of flowers) based on a text description (e.g. “a flower with long pink petals and raised orange stamen”).
GANs can be controlled by a text description (as above), by attribute-value pairs (this is the original, “Conditional GAN”) or even by a person sketching the overall picture, describing its parts via attributes, and letting a special kind of GAN (an “AL-CGAN”) create the detailed, composite image, as described in this paper.
These techniques still have a long way to go: They mainly create single frames (not streams, and certainly not multiple, synchronized streams). But that’s not going to deter the ML aficionados.
Use actual objects: In some execution platforms (e.g. automated test tracks) the SSI problem is essentially solved: There are actual, physical cars there (and actual, physical human puppets dragged by wires etc.).
Automated test tracks can be controlled by a scenario execution engine, though the control is somewhat imprecise, and the available objects are limited. Also, human / animal dolls are not completely realistic (though this is constantly improving).
Driving in real city streets involves completely-real objects. It (mostly) cannot be scripted, but you can record it and post-process the result to extract scenario coverage.
The various execution platforms
People use several different execution platforms for AV verification. Here is a typical list (glossing over some details):
- Model in the loop: Uses a high-level model of the VE+DUT, can find conceptual bugs before SW is written
- SW in the loop (SIL): Uses the actual AV SW in a simulated framework, most bugs are found there, sensor modeling also a problem
- HW in the loop (HIL): Uses some of the actual HW boxes in a simulated framework, can use the actual sensors if needed
- Stationary vehicle: Uses a real vehicle in a setup where it “sees” projected inputs and the wheels spin in place
- Automated test track: With other cars, human puppets moving by command. Can be controlled by scenario, but with limitations.
- Street driving: The real thing, driving in actual city streets. Cannot control scenarios, but can collect scenario coverage
In general, higher-numbered execution platforms are more accurate. On the other hand, they are also less controllable, harder to debug, more expensive (and thus there are less copies), and tend to appear later in the development cycle. Platforms 3..6 need realtime inputs.
SSI-wise, platforms 5 and 6 use actual objects. All others can use a combination of object shortcut, synthetic inputs, recorded data, 3d models and ML techniques.
Note that most bugs are found in the SIL “virtual” simulations. This was confirmed by a recent interview with Waymo CEO, where he said that they learn more from the roughly 1B miles they drive virtually every year (with an emphasis on corner cases) than from the actual street driving they do.
Because each platform has pros and cons, people tend to use most of them for a balanced verification project. However, the total number of “configurations” (execution platforms * applicable SSI options) is quite large. One of the goals of Foretellix’ System Verification Framework (SVF) is to simplify and unify this task.
Specifically, For each scenario S and configuration C, SVF should:
- Say if S can run in C, not run in C, or run with subset of the parameter values
- When S can run, adapt it to C’s restrictions
- After the runs, project the combined coverage results on a verification plan
SVF is a general scenario-driven, coverage-based verification framework, and thus will do much more than this (e.g. see the last chapter of this post). But the simplify-working-with-multiple-configurations part seems increasingly-important.
Beyond IAS verification
The SSI problem is annoying not just in the context of AV (and general IAS) verification. It is also an issue when we try to verify any system which takes detailed inputs from the real world.
Say your DUT is a system for diagnosing various heart conditions based on MRI scans. How do you verify that it works well? You probably have a limited number of scans of individuals with and without the conditions. It would be really nice if you could also verify your DUT against many synthetic scans corresponding to various borderline situations, but it is hard to create realistic, synthetic scans.
Note that in a sense we have a lot of experience with the SSI problem, because it also applies to simulators used for training humans. As the old saying goes, pilots are the original intelligent autonomous systems (yes, I just invented that old saying). But the problem is much bigger in verification, because serious verification needs a huge number of examples to do good coverage of all the interesting corner cases.
BTW, if you are coming from chip verification, you may be familiar with a smaller variant of SSI: When verifying graphics chips, we normally use existing graphic input files (and use constraints to just select among them, based on user-specified per-file attribute values).
Finally, SSI may be an even bigger issue when using a verification environment to train ML-based systems. I talked about this before (e.g. in this post) and said:
Another problem, unique for training-via-synthetic-inputs, is that this may cause overfitting to some artifacts of the scenario generation algorithm or the display engine. For instance, suppose we use this train-via-VE technique just to train the system on extreme and dangerous cases. If the display engine makes the sky too uniformly blue, the system could learn to be extra-careful only when the sky looks like that.
Training ML-based systems may, thus, be a real tough case. On the one hand, you need really-realistic (i.e. perhaps recorded) inputs for training it, to avoid the kind of issues mentioned above. On the other hand, you really need lots of corner cases (including dangerous ones) to make sure the system always behaves safely, so recorded inputs will probably not be enough. Another issue is that too many “danger cases” may skew the ML’s statistics.
Because of these problems, I see a tendency to move some of the safety handling to outside the ML-system proper (e.g. see the chapter “Shield synthesis and ML safety” in my HVC report). Of course, the combined system (ML + shield, or whatever) still needs to be verified, so the SSI problem for verification is not going away.
- The SSI problem is a real pain for verification
- There is no single, good solution
- Creating a common, uniform framework encompassing all solutions would lower the pain
A good introduction to AV / ADAS verification is Computational Verification Methods for Automotive Safety Systems by Jonas Nilsson of Volvo (long, best viewed in Acrobat Reader).
I’d like to thank Gil Amid, Benny Maytal, Sandy Hefftz and Thomas (Blake) French for providing feedback on earlier drafts of this post.