Bridging AV verification and AV regulation

Summary: In this post I’ll describe my impressions from the ASAM OpenSCENARIO workshop. I’ll then use that as an excuse to discuss a related topic: Many people agree that scenarios are a good way to check Autonomous Vehicles (AVs) for safety. Some of these people have thorough verification in mind, while others have regulation in mind. Can we bridge these needs?

I attended the ASAM OpenSCENARIO kickoff workshop on 13-Nov, and found it really interesting. Below, I’ll talk about:

What’s OpenSCENARIO
What I learned at the workshop
Some terminology issues
Verification vs. regulation
What regulators want
How many scenario runs are needed for regulation
Why the push for deterministic regulatory tests is problematic
Why scenario definitions should have a dual interpretation
Why regulation should start with a safety case

This is a somewhat longer post than usual (and also mentions my company, Foretellix, more than I usually do in this blog). I tried to convey the general structure of this complex space, where the topics of verification, regulation, standards, terminology and language are sort of tied together.

OK, here we go:

Some OpenSCENARIO background

I previously discussed OpenSCENARIO (OSC for short) in this post, and said:

One way to represent a concrete scenario is via OpenSCENARIO – an XML-based format proposed by Vires … it was mainly suggested as an intermediate format, which could be interpreted by the various execution platforms (SW in the loop simulators, HW in the loop simulators, test track setups and so on). This sounds like a good idea.

In that post I then talked about what I liked about OSC, and what could still be improved (things like support for a clear, user-oriented language, systematic coverage-driven verification, aggressive scenario mixing and on-the-fly generation).

Well, Vires (creators of OSC) have since donated it to ASAM (a standards body), and ASAM talked to some industry experts, and published an initial list of requirements (summary here). So I decided to go to that kickoff workshop (along with about 130 other people in-person and on-line – there was quite a bit of interest).

What I learned at the workshop

It was a pretty good meeting. Here are some of the things I found interesting (note that all the presentations are on the workshop’s website):

Marius Dupuis, godfather of OpenSCENARIO (and OpenDRIVE, and OpenCRG, since you asked) kicked off the workshop with an inspirational presentation. He called for vision, but prefaced this by quoting Helmut Schmidt saying “People with visions should go to the doctor”.
Folks from Daimler and BMW talked about the need for a high-level language, for clarity and for object orientation
There were several interesting presentations about interfacing to various languages at runtime, about bridging the static and the dynamic, about turning recordings into scenarios and so on.

I also gave a presentation, talking just about the need for a concise, composable, measurable and portable language (even with just that my presentation may have been too packed).

Regarding portability, I made the point (on slide 10) that the same scenario definition should be portable across

Execution platforms: Specific simulators, specific test tracks, …
Test configurations: Sensor-bypass vs. full sensor simulation, …
ODDs: Adapt to operating conditions, country rules and conventions, …
Stakeholders: OEMs, subsystem creators, regulators, …
Usage modes: Fully-random vs. deterministic, …

Subsequent discussions highlighted the fact that people consider portability to be highly-desirable but pretty hard to achieve (more on that below).

Anyway, I had many discussions with the various stakeholders (both during the workshop and before/after), and I am now cautiously optimistic about this process. Here is why (note that this is my own intuition, and I could be wrong):

There seem to be a rough consensus (though by no means a unanimous view) that OSC 2.0 abstract scenarios should be written in a true, object-oriented Domain Specific Language (from which you can still derive concrete scenarios in the OSC 1.0 XML format).

And many of the requirements seem to match the requirements that we (Foretellix) are targeting with our Scenario Description Language (SDL) – references to those requirements are marked in green in my presentation. Note that OSC 2.0 has lots of deep requirements which are not directly language-related (but which an object-oriented framework could help modularize).

Also, from the little I have seen, the process seems fairly efficient and ego-free. Yes, this could still be bogged down in politics or take too long (or be torn by dilemmas like regulation-vs.-verification – see below), but current indications are that we can escape this fate.

So we (as in Foretellix) decided to put some skin in the game. We’ll participate in the “proposal workshop” (currently set for 17..18-January in Munich) and see how we can contribute to that process.

Much is still uncertain. We (Foretellix) clearly need to move as quickly as possible with customers – this is the best way to get a better understanding of the needs and improve SDL. But we are also determined to make SDL open, and in that context we are going to invest some serious work in that OSC 2.0 process. Wish us luck.

Note that there are several AV-scenarios-related initiatives which we are currently tracking – it is just that right now OSC is the most, well, open. The discussion below is probably relevant to any format / language / methodology for specifying AV scenarios.

Local variations

Next, I’d like to talk about how all this relates to verification and regulation. AV regulation may play out differently in different countries, so let me first segue into the lighter topic of differences in terminology.

I wrote before about how verification-related terminology differs between the various verification “tribes” (with sometimes-amusing consequences). I am trying to keep this terminology page updated as a counter-measure.

Well, it turns out that there are lots of variations even within the AV-verification tribe (probably because it is fairly new). E.g. AVs are also called SDCs (Self Driving Cars) and other names.

The thing being tested, called in other domains DUT (or DUV, or SUT – see in that link), is often called here VUT (for Vehicle Under Test), but many people call it the “Ego” or “Ego vehicle”. I think this started out as a mainly-European thing, but is now spreading. I got raised eyebrows once in the US when I said “and in this scenario the ego has to decide whether to turn left or right” (as in “I thought you were talking technology, not psychology”). It is also simply called “the AV” or “the SDC” (or “host” or even “hero”).

There is also no fixed terminology to describe the agents around the VUT (which we control so as to cause a scenario to happen). Some call the other cars “targets” (this may be a residue of simpler times when most testing was “don’t hit that single car in your lane”, or may be military-speak for “the target we are sensing with our Radar”). Other people call them “NPCs” (Non-Player Characters – the computer-gaming term for the characters in the game which are not controlled by a human player). We use “NPCs” or “agents” internally.

Speaking of computer games, the SF Bay Area seems a bit more into using the Unreal gaming engine for simulating the environment around the AV, while the rest of the world is a bit more into Unity.

Let’s now turn to the more serious differences in the culture of regulation. Historically, the US is more into self-regulation, while Europe is more into “here is exactly how you should do it”. And then there are China, Japan and all the others – you can’t really generalize much about “what regulators want”.

Nevertheless, in the next chapters I’ll ignore all that, and talk about:

Verification vs. regulation

Suppose you are the person responsible for ensuring safety in an AV company. You clearly have a tough job. Two of your main concerns are:

Verification: How to use your substantial (but still limited) human, compute and physical (e.g. car fleet) resources to find bugs and improve safety as quickly as possible
Regulation: How to do the right thing so as to pass the regulation hurdles facing your company (in your target geographies), now and in the future

The needs of verification and regulation are somewhat-similar, but they are clearly not the same. And both are extremely important. BTW, OSC is presumably meant for both.

Let me start with verification, which is probably the bigger part. Also, I feel on safer ground there (and indeed this is the main topic of this blog). I am fairly sure it should be done using something like the coverage-driven flow in Fig. 1:

cdv_flow

Essentially (clockwise from the top left corner) one should:

Create a verification plan corresponding to the various risk dimensions
Write abstract scenarios for each such risk dimension
Use those abstract scenarios (and their mixes) to generate many different concrete scenarios
Run those concrete scenarios on the various execution platforms while monitoring the results and collecting coverage (mainly scenario functional coverage)
Analyze the coverage from all those runs (and also how well the VUT performed)
Repeat

Of course, there is much more to coverage-driven verification, e.g.:

How to define good coverage, and how to push test runs so as to reach it (see e.g. this post)
How to check (grade) the behavior of the VUT during any specific scenario. This is a really tough topic, and I promise to return to it in a future post.
How to maintain a productive pipeline of creating, executing, analyzing and debugging test runs at scale. When you want to run many interesting, different runs in parallel (Waymo currently has about 20k running at any given time), issues like repeatability and stability start to bite hard.

I am not going to dive into these topics now, but for thorough verification they need to be handled.

Let’s turn now to:

What regulators want (and should they want more)

What does one need for regulation/certification? People have very different answers for that, ranging from “run just 50..100 very deterministic, Euro-NCAP-style physical tests” to “do full verification (as described above) and show it to the regulator”.

It is not surprising that there are many views in this new, complex space. Position papers on how to ensure safety seem to pop up every second week now (e.g. see this, this, this, this and this). Some of the big regulation-related questions in the air are:

What should be done in simulation vs. test tracks vs. street driving
What scenarios should be tested
What metrics should be used (for checking that the VUT did the right thing)
How many scenario runs should be required
How can we guarantee that all tested organizations interpret the tests identically

These are all big, interesting questions (and not just for regulators: an OEM trying to compare multiple AV-stack offerings may ask similar questions).

I’ll address below just the last two:

How many scenario runs should be required for regulation? For verification we probably need, say, 1B scenario runs (several of those could happen in a single test run). But how about for regulation? I hear two main answers for this (and as you may imagine, I prefer answer 2):

Answer 1: Run, say, 100 specific scenarios in a physical test track, and 10K specific scenarios in simulation (don’t take the exact numbers too seriously)
Answer 2: Like the above, but also show the regulators the coverage from the 1B scenario runs done for verification (and let them audit the verification plan, the coverage reports, and even randomly-selected specific runs)

Small numbers of scenario runs are bad, for all the usual reasons (they don’t get to edge cases, people may optimize for them, etc.). I talked a bit about that here, where I described “expected” and “unexpected” bugs, and said:

A further complication is that there is no fully-agreed-upon dividing line between verification (which tries to find all bugs) and certification-by-regulators. Some people express the view that it is OK for regulators to only look for expected bugs, assuming that AV manufacturers will deal with “unexpected bugs” like SW errors (using different techniques).

However, perhaps a better direction might be:

Have a general verification framework, capable in principle of finding the full range of bugs
Have AV manufacturers use it to catch the full range of bugs
Have certification authorities use it to perform/assess a subset of the verification
Have certification authorities (or an independent body or some other group) at least review the more complete verification done by the AV manufacturers

I still think this direction (“one holistic framework for both verification and regulation”) is better, but it is non-trivial to achieve, and involves subtleties. For instance, verification often calls for a flexible way to add white-box (VUT-internal) coverage, but regulators will normally want to ignore that.

How to guarantee consistency between tested organizations? This is tougher than you may think, and came up quite a bit in that ASAM workshop (under the guise of “reference implementation” and “determinism”). Let me expand on that:

The promise (and peril) of determinism

Suppose you have a set of N specific concrete scenarios (i.e. scenario definitions + parameter values), and you want multiple testing organizations to run them in exactly the same way against multiple VUTs. In other words, you want to expose the VUTs to exactly the same challenges, so you can compare the results.

If N is small, and all the tests are simple Euro-NCAP-style deterministic tests, then you can (barely) do it using current methods, where you specify precisely the setup and the movements of the (few) agents ahead of time. But those methods will not work when N is 1B (or even 10K). For them to work:

The scenario creation tools should all behave identically
The runtime behavior of all agents should be identical
The simulators should all behave identically

These requirements are not really practical (except for those very simple, sterile tests).

First, the VUT itself is not really controllable (as far as the testing system is concerned). That’s why we call it autonomous, after all. Given the same set of inputs, one VUT may decide to continue as-is, while another may decide to drive slower (or turn left), ruining your well-prepared, deterministic test setup.

Also, the “world simulators” are all different, and that is actually a good thing: For instance, some of them use Machine Learning (or other complicated control algorithms) to emulate “natural” NPC behavior, so you can’t just tell an NPC “turn left now” – at best you can tell it “try to turn left ASAP”. Yes, you could request that they have a “simple, sterile mode” for running your more deterministic tests, but if you use that mode a lot, you lose much of their value.

And then there are all the problems related to repeatability and behavior stability (described here). Indeed, at that OSC workshop the topic of a “reference implementation” seemed like a hot potato that nobody really wanted to own.

Why scenario definitions should have a dual interpretation

So, if you want to compare AV implementations (e.g. to see that they meet the same set of regulations), and you are not willing to settle for just a few deterministic tests, what can you do? Here is what I would suggest (and this is indeed what we are doing in our SDL):

The language describing the scenarios should have a dual interpretation:

Active: How to (try to) cause that scenario to happen
Passive: How to monitor and see that this scenario indeed happened

The language should have precise semantics, corresponding to the passive (monitoring) definition: The determination that scenario S just happened with parameter values P should be uniform, and execution-platform independent. It is, however, OK if the same scenario definition causes somewhat-different execution sequences when run against different simulators / VUTs (especially for complex scenarios).

Another upside of having precise monitoring semantics is that you should be able to monitor / cover / check even scenarios created by other means (e.g. by interpreting a recording / log of actual street driving).

I hope I was able to convince you that this dual interpretation of scenarios is a pretty important point. Yes, scenario creation tools should also have a “deterministic mode” (the SDL tool does), but this is simply not enough except maybe in the simplest cases.

Let me end this post on a lighter (?) note, discussing another often-misunderstood regulation-related topic:

The case for safety cases

AV regulation is complicated. I wrote elsewhere that it will probably include ISO-style “process” standards (such as ISO 26262 and SOTIF), and perhaps formal components like RSS.

One important point about regulation is that the top level is not the verification plan: It is the safety case. A safety case is “a structured argument, supported by evidence, intended to justify that a system is acceptably safe for a specific application in a specific operating environment”. There are several safety-case notations (the most popular of which seems to be GSN).

Now, if you come from the verification/testing side of the family (like me), and you are somewhat impatient (like me), you may be tempted to say “well, ‘safety case’ seems to be just a fancy / bureaucratic name for the verification plan, which we use anyway to define (and track) the various scenarios to be run”.

You would be wrong, though. The safety case sits above the verification plan, and some of the things it talks about are outside the scope of the verification plan. Let me clarify that via an example:

Consider the (substantial) risks resulting from variability in AV maintenance. Your verification plan should clearly contain a (pretty prominent) section devoted to that risk dimension, and that section should have scenarios which test how the AV behaves even when (say) a camera is mounted slightly off, within some defined tolerance. And you should probably run many random variants of these scenarios (hopefully mixed with lots of other scenarios like unusual-human-behavior, low-tire-pressure, severe-weather etc.).

But the corresponding safety case section should not just point at that section in the verification plan: It should also say things like “Here is the contract with the company maintaining our AVs, here are the contractual / procedural measures we are taking to make sure cameras are always re-mounted within those defined tolerances, and here is why we think this is enough.”.

This is just to highlight that AV verification and AV regulation are not the same. This post advocates aligning them (and the related languages, methodologies etc.) as much as possible, but this requires careful thought.

Notes

This ended up a longer post than usual. I hope I have been able to clarify some of the considerations involved in creating a framework / language intended for both verification and regulation.

Comments are very welcome.

I’d like to thank Kerstin Eder, Ziv Binyamini, Daniel Meltz, Roberto Ponticelli, Steve Glaser, Thomas (Blake) French, Zeyn Saigol, Amiram Yehudai, Eran Sandhaus and Christian Gnandt for commenting on earlier versions of this post.

	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…
	Aman on DeepXplore and new ideas for v…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)