Summary: This is part one of my report about what I saw at the Stuttgart 2017 Autonomous Vehicles test & development symposium last week.
This yearly symposium seems to be a pretty good place to get the feeling of what’s going on in AV verification (at least in Europe): There are several AV-related conferences, but most devote little time to verification.
So here is part one of my report (the rest will follow soon). See also my 2015 and 2016 Stuttgart reports. As always, much of the interesting stuff was in the Q&A part or in corridor discussions. Also, for better or worse, I am still an outsider to AVs (I come from the verification side of the family). Finally, this is my own, subjective report.
Here are the main changes I noticed (to be covered in detail in part two of the report). None should come as a shock – these are just signs that AV verification is maturing:
- Simulation is now the accepted mode for finding most bugs, by just about everyone
- Everybody talks about scenarios, scenario libraries, and running lots of random combinations
- Very initial “frameworks” for handling simulations, scenarios and execution platforms are starting to appear
- Sensor simulators and sensor modeling are improving and getting a lot of attention
- There is more work on (semi-) automated analysis / labeling of recorded traffic, for both ML training and interesting-scenario extraction
This post (part one of the report) will cover the following topics:
- There is a big (but strangely un-discussed) difference between people who look for “expected bugs” and people who look for “unexpected bugs”
- Constrained-random Coverage Driven Verification (CDV) is still the exception, not the rule
Expected vs. unexpected bugs
Hardi Hungar of the German Aerospace Center (DLR) gave a comprehensive presentation titled “Test specifications for highly automated driving function: highway pilot”. He introduced parametrized scenarios and how they should be checked – more on this in part two of this report.
His example was the “cut-in” scenario, where a vehicle cuts into our AV’s lane just ahead of it. He explained how it should be simulated many times, with random values for the various parameters (like relative distances and speeds). He suggested using the expected distributions for these parameters, so as to compute the total risk profile (see e.g. the image below):
I liked Hardi’s presentation, but I was left with one nagging question: Why did he suggest using the expected distributions for the various scenario parameters? Surely for bug-finding you need bug-finding distributions (which emphasizes corner cases etc.).
So I asked him (and we had some further discussions). His answer makes sense, but also opens up new questions. Here is what he said (paraphrased): His presentation was about the needs of regulators, and they are interested in exploring “expected bugs” (so they can e.g. plug their severity and probability into ISO 26262 risk formulas as in the picture above). Regulators assume that AV manufacturers will deal with the “unexpected bugs”.
Let me first clarify what I mean by these terms via examples:
Expected bugs: Suppose your AV did not stop fast enough for a cyclist, because of some combination of: vision system too weak, sensor fusion too slow, braking not efficient enough and so on. This is an expected bug: You knew all of these could happen in principle – the bug was that under some circumstances the combined performance was less than you signed for. And to fix it you will have to equip your AV with a better camera, or something.
Note that “expected bugs” is a slight misnomer: The fact that the bug happened in a specific design under some specific conditions was not itself expected – there was just an ongoing suspicion that it might happen, because of the underlying, known issues (like “sensor fusion too slow”). Perhaps “bugs resulting from combinations of catalogued, suspected issues” would have been a better name, but I took pity on you, gentle reader.
Unexpected bugs: Suppose your AV did not stop fast enough for a cyclist, and after some debugging it turned out that this was caused by a SW bug which only happens if the cyclist appears within the first few seconds after switching from manual driving to AV driving. Or if the cyclist appears while receiving an urgent vehicle-to-vehicle message. Or your sensor fusion ML system was never trained on cases where there is a cyclist approaching while there is traffic below (say under the bridge you are currently travelling on). This is an unexpected bug (until it gets discovered, usually by testing edge cases). Spec bugs (e.g. you just never thought of a Tsunami approaching the AV – see this post) are also unexpected bugs.
This difference between expected and unexpected bugs seems fairly fundamental, and yet I don’t see much discussion of it. Here are some further observations about expected bugs:
- Over time new categories of expected bugs become catalogued, and regulators need to create scenarios for catching them. For instance, consider this presentation (pdf) about the catalog of around 1500 possible vision-system issues and how to test for them: Ideally, the catalog should be codified (into scenarios, coverage points and checks), so regulators could add it to their “official” list of expected bugs.
- For any such category of expected bugs, there should be an efficient process of scanning the relevant space looking for those bugs. Hardi clarified that this may imply using a denser search grid, and various other tricks.
- Expected bugs often involve continuous (though not necessarily monotonous) functions
- There are some gray areas between expected and unexpected bugs. And unexpected bugs, too, sometimes cluster together in a small area of the design space. Also, tests meant for finding expected bugs may also find some unexpected bugs (though they are not efficient at that).
- There is also significant overlap between the efficient techniques for catching expected and unexpected bugs (SVF, the System Verification Framework which Foretellix is building, should be good for both).
I blogged about this issue before. For instances, I mentioned here Zhao’s PhD thesis, which describes various ways (Monte-Carlo simulation and Importance Sampling) for finding expected bugs. Zhao freely admits his techniques are not good for finding e.g. SW bugs – here is his list of what his techniques are good for (all classical expected bugs, in my terminology):
i) Challenge in sensing/detection (e.g., fog, snow, low light)
ii) Challenge in perception (e.g., hand gesture, eye contact, blinking lights)
iii) Aggression of surrounding vehicles/pedestrians/pedal-cyclists
(e.g., running red light, cut-in, jaywalk)
iv) Challenge in making decisions (e.g., low confidence, multiple threats)
v) Challenge due to lower (than normal) control authorities
(e.g., slippery roads, heavy vehicle load)
Fault Tree Analysis (FTA) is an even more extreme example of handling only expected bugs. I covered it here, and said:
Simplifying a bit, FTA originated during simpler times, when mechanical / chemical engineers could think of all possible issues, and just needed a tool to compute the resulting failure frequency.
To people (like yours truly) who come from CDV (or fuzzing) and who have spent a lifetime working on automated techniques for finding bugs nobody thought of, the idea that you can just “whiteboard your way to a bug-free world” obviously sounds like a cruel joke.
To summarize (again simplifying):
- FTA assumes you know all the issues and the leaf-node probabilities, and just needs to compute the overall failure probability
- Monte-Carlo simulation (+ importance sampling and other grid-refinement techniques) assumes you know all the issues, but don’t know where they occur and at which frequency – so you need an efficient way to scan the space and compute probabilities
- CDV assumes you don’t know all the issues – it helps you find them. Once found, you either fix them, or (if left in) use some other techniques to estimate their probability
Should regulators ignore unexpected bugs? Back to Hardi’s presentation and the question of “what should regulators check”, here are some thoughts:
It may seem OK for regulators to concentrate on expected bugs, and assume AV manufacturers will find unexpected bugs: In HW design, chip designers / manufacturers indeed take full responsibility for finding all unexpected bugs themselves. Sometimes there are “compliance tests” supplied by a standards body (those correspond to expected bugs), and the HW folks will spend some (usually much smaller) effort to make sure their chip also pass those.
However, the incentives are very different: Chip designers know from bitter experience that unexpected bugs could easily kill their market or cost a fortune (remember the FDIV bug?), so they have a huge verification budget, and established techniques for finding unexpected bugs.
AV manufacturers also run huge reputational (and legal) risks if they don’t verify “enough”, but the meaning of “enough” is much less clear: Some safety standards dictate a process which involves demonstrating safety cases and so on – clearly good stuff. But these things are hard to measure / compare (and rapid release schedules make them even harder).
Also, while any kind of AV accident will be big news (at least initially), I think unexpected bugs will grab more media attention: “You mean they did not even think about a cyclist during vehicle-to-vehicle communications”? So manufacturers (and the industry as a whole) may hope for a standard way to measure “how well you looked for unexpected bugs”, even if this way is imperfect.
It will be interesting to see how this develops – these are still early days. I think that if a regulatory body came up with a standard-but-customizable scenario-based way to test for both expected and unexpected bugs – a non-trivial job – then AV manufacturers would welcome that.
CDV-for-AV-verification and Five AI
While simulations and scenarios are now accepted wisdom, the idea of CDV-style, massive, try-everything simulations seems to lag behind. I talked to several AV-simulation vendors (who will remain unnamed), and asked them how they would go about randomizing the topology, the scenarios and so on in a massive-but-controlled way: The most common reaction was “Why would you want to do that?”.
In sharp contrast, Five AI, a UK-based AV company, really gets it (perhaps because some of the founders come from the chip industry). Their presentation (“Test case synthesis and simulation for autonomous system validation”, by John Redford, VP architecture) followed a path which is pretty similar to what I have been advocating in this blog.
Essentially, John said AV safety is a system-level problem: You can build buggy systems out of perfect components, and similarly your AV can be fine even if e.g. your vision system misses some things – what you need to verify is the aggregate system in many potentially-dangerous situations. Because safety-critical situations are rare, the main tool for finding bugs should be constrained-random, CDV-style, massive SW-in-the-Loop (SiL) simulations.
So he suggested taking scenarios (hand-generated or from accident data), and randomizing the hell out of them, while tracking (functional) coverage. The kind of coverage he talked about is scenario-parameters coverage, but also (and I really like this), some internal coverage like “when did my vision system stop seeing that pedestrian”. Because their sensor fusion system outputs both what it senses and its certainty about that, they can cover things like: “was a person there” cross “did the system say there was a person there” cross “how certain was the system about that”. Similarly, they want to cross “where did this object go” with “where did the AV prediction SW think it would go” and so on. They are then hoping to tweak the simulation so as to get to all “corners” of such crosses (this is known as “coverage maximization”), perhaps using Reinforcement Learning.
John did not imply that they already have all these pieces. Nevertheless, the direction was impressive enough that I noticed several people walking up to him and asking for verification advice.
Note: As background, see this post about using Machine Learning for coverage maximization, and this long post about the various kinds of coverage (including functional coverage) and how they relate to verification efficiency and maximization.
Expect the second part of this report soon. [Added 18-July-2017: Here it is]
I’d like to thank Gil Amid, Hardi Hungar, Amiram Yehudai, Brad Templeton and Thomas (Blake) French for commenting on earlier drafts of this post.
[Added 28-June-2017: The link to the vision-issues table had problems, so replaced it with a link to the presentation about testing according to this table. The paper is here (pdf) and the table itself is here]