AV coverage and performance metrics

Summary: Sooner or later, people dealing with AV (Autonomous Vehicle) verification encounter the difficult topic of how coverage, KPIs (Key Performance Indicators) and performance grades relate to each other. Also, there is often confusion between “Did this scenario happen” and “Did the AV perform well in this scenario” (and the fact that “KPIs” are sometimes used in both contexts just makes it worse). This post tries hard to clarify all that, and to suggest some (hopefully useful) terminology.

When doing heavy-duty, scenario-based AV verification, people often want to assess “where we are”. This has two main parts:

Coverage evaluation: Which part of the “scenario space” have we exercised our AV in? This is expressed via coverage and an overall coverage grade.
Performance evaluation: How well the AV behaved in these scenarios? This is expressed via raw KPIs (Key Performance Indicators) and context-dependent performance grades.

Coverage and performance metrics are somewhat similar (and thus easy to confuse). Hence this post.

But first, let me put this in context:

The big picture

This post is about dynamic (mostly simulation-based) AV verification: This involves planning the verification job, running massive amounts of scenarios (usually with the help of some constrained-random methodology), analyzing the results, modifying the AV and/or the verification environment accordingly, and repeating the process.

Note that there’s a lot more to AV safety (e.g. constructing safety cases – see the last chapter here). Simplifying greatly, safety people strive to come up with a single “residual risk” number (e.g. some weighted measure of miles / accident) – a very hard thing to do.

However, this post is not about that: It focuses on the day-to-day AV verification flow. Figure 1 tries to capture that flow in one (slightly dense) picture:

flow_picture Note that each test (also called a test-case) specifies one or more scenarios to be run. A run is what happens when you execute a test (with some specific random seed) on some testing / execution platforms (e.g. SW simulation, Hardware In the Loop, test track etc.)

The dashed back-arrow represents the human-directed (and sometimes partially-automated) feedback loop which drives the system towards higher coverage and more bugs.

With this big picture in mind, let’s get back to the coverage and performance metrics:

Coverage evaluation

Coverage (especially scenario functional coverage) measures “what we have tested so far”: Which scenarios were actually exercised, and for each such scenario what were the values of its various parameters.

Note that when you define coverage, you decide what the coverage items (parameters) are, into which buckets (groups of values) to split each item, which items to cross, and what weight to give each bucket. This is all captured in a top-down verification plan, which describes what needs to be tested.

Coverage is typically measured in hits-per-bucket, as compared to the desired-hits-per-bucket. This is aggregated over many runs, then aggregated bottom-up over a verification plan hierarchy to give you a total coverage grade between 0 (nothing covered) and 1 (completely covered).

For instance, we may have exercised our overtake scenario 2000 times, but only 500 of those runs hit the bucket “overake_side == right”, only 90 hit “overtake_speed in [70..80]kph”, and none hit the cross-bucket of these two (an overtake scenario with side equal to right and speed in [70..80]kph). Depending on the weight we give these buckets, we may deem this to be not enough, so we need to fill that coverage hole (testing gap).

Note further that there are input and output coverage items, but that distinction is context dependent. For instance, if it is some other car which overtakes our AV, then both overtake_side and overtake_speed are input items. However, if we are talking about our AV overtaking another car (and assuming we cannot control the AV), then these are output items (though we may be able to influence them in various ways).

Defining coverage (and especially defining coverage buckets) is a minor art form involving discipline and intuition. The goal is to maximize “verification” of the scenario space defined by our Operational Design Domain, given our (substantial but still finite) human and compute resources. It should be informed by our understanding of the ODD, our previous knowledge of “what can go wrong” and many other factors. If you have nothing better to do, read all about it in this post.

Performance evaluation

The purpose of performance evaluation is to see how well the AV performed in those runs (along multiple performance dimensions such as safety, comfort etc.).

We also often want a pass/fail indication: For runs which are “bad enough”, we want to emit an error message, indicating that the DUT (Device Under Test) may have an issue, and thus somebody needs to debug this run and decide whether something needs to be done about it.

KPIs or Key Performance Indicators are the raw metrics we want to measure to see how well the AV performed. There can be safety-related KPIs (like min-Time-To-Collision or min-TTC, measured in seconds), comfort-related KPIs (like max-deceleration, measured in meter/second²), and so on.

Here is one of many surveys regarding various AV-related KPIs. Note that there are some terminology variants in this space (e.g. some people use the term “criticality measures” for safety-related KPIs).

Suppose that we ran some scenario 2000 times, and we got some distribution of min-TTC: The worst (shortest) min-TTC we got was 0.7 seconds, and this happened in 50 of those runs. Should those 50 runs produce a DUT error?

Not necessarily: This really depends on context. If our scenario exercised the AV behavior in some extremely challenging situations, perhaps a min-TTC of 0.7 seconds is actually pretty good. On the other hand, if this was a scenario with absolutely no challenges at all, then 0.7 seconds should probably cause a DUT error.

This takes us from raw KPIs to performance grades: A performance grade (also called a “normalized KPI”) is a context-dependent number between 0 (“really bad”) and 1 (“excellent”) which we assign to some aspect of the DUT behavior. This grade is computed using a grading formula, which converts one or more raw KPIs into a grade in a context-dependent way. This is hard work: Somebody must write this formula, taking into account the specifics of the verification task and the relevant “context parameters”.

We may want to grade a scenario run on min-TTC, on safety in general (some combination of the min-TTC grade and other safety grades), on max-deceleration, on comfort in general and so on. People often also like to give an overall performance grade, which is some combination of the specific grades.

And, for each of these grades, there is a (context-dependent) threshold, below which we issue a DUT error.

Performance grading depends on many things:

The scenario in which the KPIs were measured (as described above)
The purpose of testing (e.g. if this is developer testing of some SW changes, we may want to get an error message on any degradation which seems related to those changes)
The maturity of the system under test: If we have 10k failing runs per night, we may want to make the threshold higher to deal with the “worst offenders” first
The relative weight of safety vs. legality vs. comfort etc.
The country we are in
Can we even grade a single run (or do we need to integrate over many runs, using MCMC-like techniques)?
And so on

AV performance evaluation is really hard: It may involve scenario-independent incident analyzers which try to determine whether some critical situations are avoidable. It has to contend with the inherent difficulties of checking probabilistic components (see this post). But somebody’s got to do it.

Coverage and performance analysis

As we have seen, coverage and performance metrics are different but also similar. For each test run, we need to store somewhere the detailed information about:

Which coverage buckets were hit?
What were the values of the various performance parameters (both raw and normalized)?

Finally (as shown in Fig. 1) we need some multi-run analysis tool to give us an aggregate view of the metrics gathered over, say, the 1M runs of the last weekend:

Coverage:
- What’s the current coverage grade (overall and specifically for scenario S)?
- What are the main coverage holes for scenario S? How do they cluster (i.e. are there big uncovered areas)?
Performance:
- What were the values for the min-TTC KPI (overall and specifically for scenario S)? Do those cluster in some interesting way?
- How are we doing on min-TTC grades (overall and for scenario S)? Where is this worse / better than the previous SW release?
- How may runs actually failed with a DUT error (i.e. with a grade below the threshold)? How do they cluster?
- What’s the trend in all of these (relative to the previous week)? Which metrics improved and which degraded?

There’s a lot more to say about coverage and performance metrics, of course – this post mainly tried to clarify terminology. Comments are very welcome.

Notes

I’d like to thank Gil Amid, Stefan Birman, Ziv Binyamini, Kerstin Eder, Justina Zander, Yaron Kashai, Roberto Ponticelli, Ahmed Nassar, Thomas (Blake) French and Rahul Razdan for commenting on earlier drafts of this post.

	When is misalignment… on It’s the spec bugs that kill y…
	When is misalignment… on Verifying friendly AI: our fin…
	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)