Summary: This post looks further into ways to compute the residual risk of ADS. In particular, it looks into some recent research directions, and concludes that (1) they show promise, and (2) they are not there yet.
My previous post about ADS Residual Risk (RR) ended with the following tantalizing promise:
A subsequent post (coming real soon) will discuss what I found when I researched two fast-moving areas (both of which have interesting lessons for ADS verification):
- Techniques for “continuous-risk” optimizations (importance sampling, adaptive stress testing, ML techniques and so on)
- Techniques from (security-related) fuzzing research
Sneak preview: The answer to “can you estimate ADS RR automatically” is still “no”. However, there is lots of promising research with implications for the general ADS verification problem.
You are now looking at that subsequent post: I read some papers and reached out to researchers, and here is a summary of some promising techniques I found.
Expect yet another RR-related post “soon”. It will also talk about the following question: Given a set of test failures, can you estimate the risks associated with them before manually debugging them?
Continuous-risk optimization techniques
There are in fact pretty good techniques for handling areas A and B (see fig. 1 of that post), where the “risk function” is (mostly) continuous. A good recent paper is A Survey of Algorithms for Black-Box Safety Validation by a bunch of (mostly-Stanford) people.
The survey discusses several techniques: Importance Sampling, Adaptive Stress Testing (AST), ML-based optimization techniques and so on. All are pretty useful when you can apply them: Not only do they let you (sometimes) estimate residual risk, but they also help you accelerate the verification process (i.e. get to bugs faster).
The survey refers to papers about several “classical” scenarios where these techniques can be used: Scenarios related to lane following, intersections, lane changes and platooning. There are also some Pegasus-related papers which do similar stuff for cut-ins etc. on highways.
One interesting question is how much these techniques can be extended: They cannot (currently) work for the kinds of discrete SW / HW / spec bugs mentioned in the previous post. But there are various intermediate cases which perhaps (with some effort) could use those techniques – a very interesting research area.
Discussion with Mark Koren: I reached out to Mark, a PHD student in Stanford who is one of the authors of that survey. He essentially confirmed my intuitions:
Importance sampling is indeed good for computing the residual risk. However, it is pretty limited: You must (somehow) know all the relevant parameters are and their joint distribution. Also, importance sampling is not the best technique for finding risks.
Many researchers are now turning into ML-based techniques. Unlike importance-sampling-based techniques, those do not give you residual risk estimates, but they are much more efficient for finding risks (and you don’t need to know the distributions).
This still only works for mostly-continuous risks, but he claims (from their experiments) that there are many of these – more than you would expect.
Discussion with Richie Lee: I had an email discussion with Ritchie, another author of the survey, who is a research scientist at NASA Ames. The following is from one of his emails:
AST doesn’t technically distinguish between the various types of failures, so it could be useful in all three categories. But I do understand the distinction you are making. For example, in the top left category, importance sampling can be used to estimate the risk, but AST can still be used to find the most likely failure scenario. In the bottom category, AST doesn’t work as well for discrete spaces, mostly due to the distance function. I’ve seen some work in hybrid systems that define hybrid distances that may help. We also previously talked about the idea of combining AST with formal verification approaches.
Techniques from (security-related) fuzzing
In this chapter we are going to look at area C (discrete risks, unknown distributions, often HW/SW/spec bugs), to see whether it can be tackled in some other way. We are going to look for a solution in the fast-moving area of fuzzing-for-security. Sneak preview: No luck there either.
While (security-related) fuzzing is a different field, I find fuzzing research to be very relevant for ADS safety verification. As I explained here:
For better or worse, security is hot now. So it receives a lot of funding and mindshare, which means that we (verification folks) can gain by tracking what works for security, and trying to translate it into our domain.
For a good introduction to fuzzing take a look at the interactive fuzzing book. There is, of course, the ever-annoying issue of different terminologies: For instance, what the fuzzing folks call a “fuzzing campaign” may be more familiar to you as a “regression” or “the set of runs executed over a night / weekend”.
There are lots of relevant fuzzing papers, but in this post I’ll talk about:
The STADS framework: The STADS (Software Testing As Species Discovery) framework suggests a very useful way to estimate the residual risk remaining after a fuzzing campaign: As the name implies, it borrows from techniques ecologists use for estimating the number of animal species (in, say, a tropical rain forest). It uses these techniques to estimate the number of undiscovered vulnerabilities.
As such, it looks like an important step towards automating the ADS RR computation. However, it falls far short of solving it (though future research may change that, and the STADS paper, including its discussion of challenges and opportunities, is a good start).
In what sense does STADS fall far short of automating ADS RR computation?
RR needs a different interpretations of “residual risk”: The STADS paper defines residual risk more-or-less as the approximate number of undiscovered vulnerabilities in the tested system, which could be discovered by the current fuzzer configuration, if the current “fuzzing campaign” would continue forever (without user intervention).
This is obviously different from what we need for (ADS) RRE.
ADS RR needs the bug’s probability: Residual risk can be computed roughly as sum(probability * severity), for all remaining risks.
So for each distinct bug, we need to know what is the probability it will be encountered (in the “expected” or operational distribution, according to the expected Operational Design Domain). STADS does not have to worry about that (though the STADS paper references some discussion about expected distributions).
The main reason bug probability is much less important in security research (and practice) is that they are assuming malicious intent: Once the bad guys find a vulnerability (e.g. by fuzzing), they will suddenly make it very probable (by starting to use it).
This is obviously not the case for ADS RRE, and currently the practical way to assess a bug’s probability is to manually debug it (though there are of course many techniques to partially-automate that). Then you can try to assess the probability of the combination of conditions which might cause it. This is hard to do, so typically people “just fix the bug”, but sometimes (when fixing it involves hard tradeoffs) people will perform this analysis.
More on that in a subsequent post.
ADS RR needs the bug’s severity: An ADS bug which typically causes death is different from an ADS bug which typically causes the ADS to double-park (though both involve risk).
Again, the practical way to assess a bug’s severity is to manually debug and understand it. If it cannot be completely removed (because it involves hard tradeoffs), then we may be able to use Importance Sampling etc. to estimate the residual risk from that problem.
ADS RR needs to take into account human intervention: Typically, you run a single “regression” over a period (night / weekend etc.) with no human intervention. That regression can be thought of as a “fuzzing campaign” using a set of manually-configured fuzzers, some of which are “adaptive” (i.e. they actively search for bugs during the regression based on information discovered earlier in the same regression).
Then engineers look at the result, try to understand the bugs, fix some, and adjust the fuzzers (or invent new ones) according to what they learned. And the process will then be repeated in the next period. See this post for an (longish) overview of this process.
This continues for months, and our ADS RR question is “what is the residual risk after all those months”.
ADS RR needs to handle gray areas: This may be viewed an extension of the previous items: The automatic testing mechanism employed during the “fuzzing campaign” will have to mark many of the issues it finds as “gray”. They may be real bugs, or may be OK – only manual debugging can tell.
Following that manual debug, engineers may change the “testbench logic” so that in the future some of the gray area will now be designated “black” (a real error) and some “white” (not an error), but some will stay gray.
Discussion with Marcel Böhme: I had an email discussion with Marcel (author of the STADS paper), and he indeed said:
My definition of residual risk is the probability that the next generated test input exposes an error (when none has been found). This somewhat takes care of your first point. However, it does require extension to account for bug severity and human intervention (your two other points).
My current intuition is that there is still no good automatic solution for computing ADS RR. But I am very interested in researching this further – comments are very welcome.
I’d like to (again) thank Rahul Razdan, Andreas Zeller, Marcel Böhme, Amiram Yehudai, Mark Koren, Ritchie Lee, Valentin Nikonov, Gil Amid, Ziv Binyamini, Justina Zander, Ahmed Nassar and Sharon Rosenberg for their help in researching this topic.