ADS residual risk – some related research – The Foretellix CTO Blog

Summary: This post looks further into ways to compute the residual risk of ADS. In particular, it looks into some recent research directions, and concludes that (1) they show promise, and (2) they are not there yet.

My previous post about ADS Residual Risk (RR) ended with the following tantalizing promise:

A subsequent post (coming real soon) will discuss what I found when I researched two fast-moving areas (both of which have interesting lessons for ADS verification):

- Techniques for “continuous-risk” optimizations (importance sampling, adaptive stress testing, ML techniques and so on)
- Techniques from (security-related) fuzzing research

Sneak preview: The answer to “can you estimate ADS RR automatically” is still “no”. However, there is lots of promising research with implications for the general ADS verification problem.

You are now looking at that subsequent post: I read some papers and reached out to researchers, and here is a summary of some promising techniques I found.

Expect yet another RR-related post “soon”. It will also talk about the following question: Given a set of test failures, can you estimate the risks associated with them before manually debugging them?

Continuous-risk optimization techniques

There are in fact pretty good techniques for handling areas A and B (see fig. 1 of that post), where the “risk function” is (mostly) continuous. A good recent paper is A Survey of Algorithms for Black-Box Safety Validation by a bunch of (mostly-Stanford) people.

The survey discusses several techniques: Importance Sampling, Adaptive Stress Testing (AST), ML-based optimization techniques and so on. All are pretty useful when you can apply them: Not only do they let you (sometimes) estimate residual risk, but they also help you accelerate the verification process (i.e. get to bugs faster).

The survey refers to papers about several “classical” scenarios where these techniques can be used: Scenarios related to lane following, intersections, lane changes and platooning. There are also some Pegasus-related papers which do similar stuff for cut-ins etc. on highways.

One interesting question is how much these techniques can be extended: They cannot (currently) work for the kinds of discrete SW / HW / spec bugs mentioned in the previous post. But there are various intermediate cases which perhaps (with some effort) could use those techniques – a very interesting research area.

Discussion with Mark Koren: I reached out to Mark, a PHD student in Stanford who is one of the authors of that survey. He essentially confirmed my intuitions:

Importance sampling is indeed good for computing the residual risk. However, it is pretty limited: You must (somehow) know all the relevant parameters are and their joint distribution. Also, importance sampling is not the best technique for finding risks.

Many researchers are now turning into ML-based techniques. Unlike importance-sampling-based techniques, those do not give you residual risk estimates, but they are much more efficient for finding risks (and you don’t need to know the distributions).

This still only works for mostly-continuous risks, but he claims (from their experiments) that there are many of these – more than you would expect.

Discussion with Richie Lee: I had an email discussion with Ritchie, another author of the survey, who is a research scientist at NASA Ames. The following is from one of his emails:

AST doesn’t technically distinguish between the various types of failures, so it could be useful in all three categories. But I do understand the distinction you are making. For example, in the top left category, importance sampling can be used to estimate the risk, but AST can still be used to find the most likely failure scenario. In the bottom category, AST doesn’t work as well for discrete spaces, mostly due to the distance function. I’ve seen some work in hybrid systems that define hybrid distances that may help. We also previously talked about the idea of combining AST with formal verification approaches.

Techniques from (security-related) fuzzing

In this chapter we are going to look at area C (discrete risks, unknown distributions, often HW/SW/spec bugs), to see whether it can be tackled in some other way. We are going to look for a solution in the fast-moving area of fuzzing-for-security. Sneak preview: No luck there either.

While (security-related) fuzzing is a different field, I find fuzzing research to be very relevant for ADS safety verification. As I explained here:

For better or worse, security is hot now. So it receives a lot of funding and mindshare, which means that we (verification folks) can gain by tracking what works for security, and trying to translate it into our domain.

For a good introduction to fuzzing take a look at the interactive fuzzing book. There is, of course, the ever-annoying issue of different terminologies: For instance, what the fuzzing folks call a “fuzzing campaign” may be more familiar to you as a “regression” or “the set of runs executed over a night / weekend”.

There are lots of relevant fuzzing papers, but in this post I’ll talk about:

The STADS framework: The STADS (Software Testing As Species Discovery) framework suggests a very useful way to estimate the residual risk remaining after a fuzzing campaign: As the name implies, it borrows from techniques ecologists use for estimating the number of animal species (in, say, a tropical rain forest). It uses these techniques to estimate the number of undiscovered vulnerabilities.

As such, it looks like an important step towards automating the ADS RR computation. However, it falls far short of solving it (though future research may change that, and the STADS paper, including its discussion of challenges and opportunities, is a good start).

In what sense does STADS fall far short of automating ADS RR computation?

RR needs a different interpretations of “residual risk”: The STADS paper defines residual risk more-or-less as the approximate number of undiscovered vulnerabilities in the tested system, which could be discovered by the current fuzzer configuration, if the current “fuzzing campaign” would continue forever (without user intervention).

This is obviously different from what we need for (ADS) RRE.

ADS RR needs the bug’s probability: Residual risk can be computed roughly as sum(probability * severity), for all remaining risks.

So for each distinct bug, we need to know what is the probability it will be encountered (in the “expected” or operational distribution, according to the expected Operational Design Domain). STADS does not have to worry about that (though the STADS paper references some discussion about expected distributions).

The main reason bug probability is much less important in security research (and practice) is that they are assuming malicious intent: Once the bad guys find a vulnerability (e.g. by fuzzing), they will suddenly make it very probable (by starting to use it).

This is obviously not the case for ADS RRE, and currently the practical way to assess a bug’s probability is to manually debug it (though there are of course many techniques to partially-automate that). Then you can try to assess the probability of the combination of conditions which might cause it. This is hard to do, so typically people “just fix the bug”, but sometimes (when fixing it involves hard tradeoffs) people will perform this analysis.

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)