Estimating the residual risk of ADAS / AV

Summary: This post talks about estimating the residual risk of an ADAS / AV design. It looks at the ADS total-risk map, divides it into several areas, and discusses ways for estimating the residual risk of each.

People who deal with ADS (Autonomous Driving Systems, which includes ADAS and Autonomous Vehicles) are naturally very interested in estimating the residual risk in their ADS (after having done a lot of testing-and-fixing). They are willing to settle for an approximate estimate (with some reasonable error bounds).

Also, some of them are hoping for a fully-automatic, push-button way to do it.

What do they mean by “estimating the residual risk”? A somewhat blunt (and slightly inaccurate) way to put it is: “For the current version of my ADS, estimate how much total damage (deaths, injuries, property damage etc.) it will cause per 1B km”. For instance, human-driven vehicles in the US cause about 7 deaths per 1B km, and clearly they want the ADS to be safer (so as to maintain a “positive risk balance”).

More on risks in general, and the various interpretation of “residual risk”, below.

I’ll try to show that it is possible to estimate ADS Residual Risk (RR) using a combination of techniques. Some of those require heuristics and human intervention (but automation and a good methodology can make it easier).

I’ll also try to clarify why RR estimation cannot be fully-automated: Residual risk really consists of two pieces: Residual risk from continuous, known issues with known input distributions (for which you can fully-automate RR estimation), and residual risk from everything else (for which you cannot).

Let me elaborate on that:

Digging a bit deeper

Residual risk from continuous, known issues with known input distributions: Consider automatic braking for pedestrians. However you design it, you’ll still have some risk: Braking later increases the risk of hitting the pedestrian, but braking earlier increases the risk of the vehicle behind hitting the ADS.

The risk function here is (mostly) continuous, and we (hopefully) know all the input parameters influencing the result. Further, we know the joint distribution of those parameters in our target Operational Design Domain.

For this category, there are efficient statistical techniques for simulation-based RR estimation: For instance, you can use Monte Carlo integration techniques (Importance Sampling, MCMC etc.).

The description above is terribly oversimplified: The assumptions above (that we know the joint distribution of all relevant parameters etc.) are often wrong. And we need to also check that our simulator correctly represents real-world physics, that our statistical information is still relevant in this new ADS world, and so on. But let’s ignore those complications for now.

Residual risk from everything else: This includes, for instance, residual risks from unknown logical bugs, e.g. HW / SW / spec bugs. Here are some examples (taken from this post):

Suppose your AV did not stop fast enough for a cyclist, and after some debugging it turned out that this was caused by a SW bug which only happens if the cyclist appears within the first few seconds after switching from manual driving to AV driving. Or if the cyclist appears while receiving an urgent vehicle-to-vehicle message. Or your sensor fusion ML system was never trained on cases where there is a cyclist approaching while there is traffic below (say under the bridge you are currently travelling on). This is an unexpected bug (until it gets discovered, usually by testing edge cases). Spec bugs (e.g. you just never thought of a Tsunami approaching the AV – see this post) are also unexpected bugs.

Those unexpected bugs mostly correspond to the SOTIF “unknown unsafe conditions”. The most efficient way to find such bugs is Coverage Driven Verification (CDV) – more on this later. But how about estimating the related residual risk?

Estimating the residual risk from everything else: In this post I suggested a process for doing that:

Use an efficient bug-finding methodology (e.g. CDV) for a while, constantly adapting it as bugs are found
Wait until the bug-finding rate goes below a certain threshold
Now take (say) the last two-weeks-worth of bugs
For each such bug, estimate its risk (i.e. expected frequency * severity) if it were not removed
Aggregate the estimated risk over all those bugs, and multiply by some healthy factor
This will give you some reasonable estimate of the residual risk (if you stopped testing and fixing right now).

Note that this process involves human judgement (e.g. in estimating the frequency and severity of the bugs found), so it is not fully automatic. Still, it is the best we can do (and CDV + good tools make it easier).

Of course, you could estimate RR fully-automatically by running enough, fully representative simulations using the expected distribution of everything. But this is impractical: There is a really long tail of issues, and the unknown cases are really hard to find, which is why you need CDV.

Some further notes:

The above is (again) oversimplified for the sake of brevity
Phil Koopman talks here about the related topic of surprise metrics
Bug seeding is one technique for assessing how thorough is your verification is (but it has limitations).

The ADS total-risk map

Let us now look at the total-risk map. It actually consists of three main risk areas (because the “everything else” case further splits into continuous and discrete risks).

Figure 1 below tries to explain all three in one (slightly-dense) picture:

Notes about areas B and C:

Residual risk (sum of probability * severity) cannot be estimated automatically for them
They start out mostly unknown (like SOTIF’s unknown unsafe conditions)
As you discover issues, you try to fix them or to move them to area A
They cannot be ignored

You cannot ignore areas B and C: At the risk of stating the obvious, let me clarify why:

Consider last year’s AAA findings regarding problems ADS have in handling cars stalled on the highway, parked cars and pedestrians in various situations: These all seem (at least at first glance) to reside in areas B or C.

Or consider the paper A Comprehensive Study of Autonomous Vehicle Bugs which analyzed ~500 bugs reported for two open-source AV projects (Baidu and Autoware). Most of the (admittedly early-stage) bugs found were in area C: Logic errors, bugs which crash the AV SW and so on. Interestingly, the paper makes the point that many of the bugs (e.g. planning bugs, the biggest category) can be found in sensor-bypass (object-level) mode.

And so on: There is a very long tail of areas B and C issues.

Using Coverage-Driven Verification: The most pragmatic, efficient way to handle all three areas is Coverage Driven Verification (CDV). In a nutshell (see this post for more details) CDV lets you:

Plan the verification job
Hunt efficiently for problems in huge coverage spaces
Use various optimization techniques
Track where you are relative to the plan

Note: My company, Foretellix, uses CDV to efficiently handle all three areas. However, this post is not about Foretellix.

More on risks

The topic of risk is, of course, huge: You could write books about it (and indeed many people have). This chapter will touch just a few relevant sub-topics.

The meaning of “residual risk”: An annoying fact about verification is that people find it hard to agree on terminology. “Residual risk” is no exception: Everybody agrees that it means “The risk which remains, after you already account for X”. However, there seems to be less agreement about what, exactly, is X.

I tend to go with the Wikipedia, which defines “residual risk” simply as the actual risk in the released vehicle (after accounting for the various protections / mitigations).

Bugs vs. risks: I heard some people (especially safety people) saying “I don’t care about bugs – I care about risks”. They are nominally right, but the full picture is more nuanced:

Bug hunting (e.g. in CDV) is done by (smartly) running many scenarios while activating various checkers. A checker may issue an error message when some KPI exceeds its threshold, or when some internal C++ assertion fires (say in the ADS planning module), or some other “seemingly bad thing” happens.

Take that C++ assertion firing: It probably indicates an ADS bug (the planning module reached some illegal state). However, it does not necessarily indicate an immediate risk: Hopefully there is enough redundancy in the system, so that a bad state won’t do any real harm. But you can’t be sure until you debug the problem.

So bugs are not necessarily risks: They are risk factors which may result in a risk. Which is why we look for them.

Not all risks are created equal: ADS companies are (rightfully) particularly concerned about ADS making mistakes which a human driver would never make. Even if these are rare, public opinion (and juries) are bound to view those very negatively.

Note that these mistakes mosty appear in area C of the map above.

Residual risk is just one thing to consider: How do you decide your ADS is ready to ship? Residual risk computation (which is the main topic of this post) is just one component of that.

People mostly use a safety case to judge the overall readiness of their ADS. While not everybody is convinced that safety cases actually “work”, they are the most common way to gather all the information / arguments for why your ADS is safe (see “The case for safety cases” in this post).

Second installment coming Real Soon Now

In this post I brought up various aspects of RR, and described a process for estimating it. I also claimed there is currently no fully-automated way to estimate RR. However (gasp) I could be wrong about that. So I decided to dig again into this area.

A subsequent post (coming real soon) will discuss what I found when I researched two fast-moving areas (both of which have interesting lessons for ADS verification):

Techniques for “continuous-risk” optimizations (importance sampling, adaptive stress testing, ML techniques and so on)
Techniques from (security-related) fuzzing research

Sneak preview: The answer to “can you estimate ADS RR automatically” is still “no”. However, there is lots of promising research with implications for the general ADS verification problem.

Notes

I’d like to thank Rahul Razdan, Andreas Zeller, Marcel Böhme, Amiram Yehudai, Mark Koren, Ritchie Lee, Valentin Nikonov, Gil Amid, Ziv Binyamini, Justina Zander, Ahmed Nassar and Sharon Rosenberg for their help in researching this topic.

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)