Don’t stay in Monte Carlo (for AV verification)

Summary: This post talks about why Monte Carlo simulations (which uses the expected distribution) will most likely not get you to safe Autonomous Vehicles. It also talks about what I learned at the ASAM OpenSCENARIO workshop in Munich.

Several people I talked to lately assumed that AV verification should mostly be done using Monte-Carlo simulation. That term usually means doing random simulations using the typical, expected real-life distribution of everything: For AVs, that would mean using the expected distribution of speeds, maneuver types, road types, car densities, pedestrian walking patterns and so on.

I talked in the past about why using (mainly) the expected distribution is not a good idea. Let me try to convince you of that using a short summary and a picture.

Just to clarify: I am referring here to the common usage of the term “Monte-Carlo”. Some people use “Monte-Carlo” to mean “any simulation involving randomness”, and I certainly have nothing against that. Also, there are Monte-Carlo-based search techniques (like MCMC), and I am all for them. Finally (and before you ask) I have nothing against Monte Carlo itself – presumably a fine place with world-class casinos and a friendly populace of 3500 mostly-innocent souls.

OK – back to the issue at hand:

On the surface, using the expected distribution does make sense: Shouldn’t you mainly test the situations in which the AV will spend 95% of the time?

Well, no.

Longer answer: Initially you should indeed do that. When your AV logic is brand-new, naturally you should make sure it can smoothly do all the basic stuff: Turning, merging, obeying traffic lights, navigating obstacles and so on.

But once that’s achieved, you should shift gears and concentrate mostly on the remaining 5%. Because by then the remaining 5% will be a much more dangerous place than the initial 95%, and most of the mis-behaviors / bugs exhibited by your actual deployed AV fleet will probably occur there.

That’s because the remaining 5% holds a huge, diverse and very-lightly-tested behavior space. Things like:

A woman walking a bike across a dark, do-not-cross stretch of a multi-lane highway
Behavior during (and directly after) accidents
A million other rare-but-dangerous situations

In other words, while you only get to that space during 5% of the time, that space is full of many potentially-critical situations, some of which the AV designers perhaps did not even specify.

And a big AV fleet is bound to encounter those situations on a regular basis. And situations which were not tested are bound to contain bugs.

Let me illustrate that with a picture:

So overall, most of your simulation cycles should probably go to that 5%. To do it efficiently, you need the right tools (for creating and mixing scenarios, defining and collecting coverage and so on). But more importantly, you need the right methodology and frame of mind: You can’t just say “yes, we certainly plan to also do some corner-case testing”.

It’s a process

You should probably take a look at Coverage-Driven Verification. One of the things CDV tries to do is to distribute the testing budget (human resources, simulation cycles and so on) according to the relative “size” of the risk dimensions (and corresponding scenario spaces), while using various search techniques for efficiency.

There is, of course, a process to this: Assuming you have a large library-of-dangerous-things (crazy_driver_crossing_in_red, sensor_failure, …), you should initially mix in just one or a few of those, keeping the other ingredients (speeds, roads etc.) mostly-typical. You then gradually turn on the heat (at a rate which is mainly determined by the capacity of your verification / engineering team to cluster, understand and debug the resulting runs and failures). And individual engineers may want to run complicated mixes early, to stress new features / bug fixes.

Note: All these things (including the jokes about the “last 5% taking another 95% of the work”) are old news to people who routinely do complex verification: People who design e.g. CPU chips simply assume it is always so (and they pre-allocate more than half the budget to verification).

Some final notes on this:

Some people defend Monte-Carlo by saying that it is the only way to estimate the probability of failure. But for complex systems with lots of SW this is not true either – see the chapter “Estimating SW failure rate” here.
Expected distributions are often useful for grading – i.e. computing how well the AV did in a particular scenario. That’s because the “right” behavior is often probabilistic (and thus the same AV behavior may get very different grades in Boston and in Bangalore).
Regulators historically tend to prefer physical tests, which (naturally) don’t stray enough into the 5%. Creating a framework which is useful for both regulation and verification is tough – see my last post.

ASAM impressions – take 2

In that last post I also mentioned my impressions from the November ASAM OpenSCENARIO workshop. Well, I just attended the follow-on workshop (in Munich), and I am still cautiously optimistic (in fact, somewhat more optimistic) about that process.

There was a very diverse set of views, but the overall atmosphere was friendly and practical, and there was a general agreement about the urgency of creating an improved scenario framework.

There were several good discussions about the need for a scenario language (perhaps in the spirit of our Scenario Description Language), and there seems to be some agreement that such a language could provide a good unifying structure for that project.

I’ll keep you posted on how it goes.

Notes

I’d like to thank Ziv Binyamini, Roberto Ponticelli, Zeyn Saigol, Gil Amid, Amiram Yehudai and Christian Gnandt for commenting on previous drafts of this post.

	When is misalignment… on It’s the spec bugs that kill y…
	When is misalignment… on Verifying friendly AI: our fin…
	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)