How to write AV scenarios (and some notes about Pegasus)

Summary: There are several approaches for verifying that Autonomous Vehicles are safe enough. The Pegasus project is one interesting, thoughtful attempt to do that (focusing initially on highly automated driving, not AVs). In this post I’ll summarizes a recent Pegasus symposium, and describe what I like about the approach and what is perhaps still missing. I’ll then use an example scenario to talk about spec bugs (and “unexpected bugs” in general), and their implications for verification frameworks.

What’s the right way to verify / certify AVs? And who should decide? These are open questions with no clear answers yet. So any meeting where regulators, AV manufacturers, simulator vendors and other stakeholders try to come up with a common vision for AV verification should be interesting to watch. And the Pegasus symposium (held in Aachen, Germany last November) was just that (though it centered on the simpler case of highly automated driving, e.g. “highway chauffeur”).

Let me start by confessing that I did not attend it. Still, it looked like an important event, so I went through the slides (and asked around a bit), and came up with this summary. Note that my main topic here is “how to verify and certify AVs” – I am just using the Pegasus approach as a good, well-documented example.

Please take this with an even bigger grain of salt than usual: While I have been tracking the European / German AV verification scene for a while (see e.g. here), I do not have an automotive background, and (like I said) I was not even there. Any comments / corrections are very welcome.

The Pegasus symposium

I wrote about Pegasus here: It is a German effort for developing “a generally accepted and standardized procedure, for the testing and approval of automated driving functions” (here is an overview of the project).

The symposium included various German participants (such as OEMs, Tier 1 companies, simulator suppliers, academics and TÜV SÜD– a certification authority), as well as people from several other countries. There seems to be high interest in this topic.

The emphasis was on Highly Automated Driving (HAD) certification – which is obviously much easier than full AV verification (defined loosely as “the process of finding bugs”). Some companies may already have HAD features, but they don’t turn them on until they can be certified by the proper authority.

I am interested in how various stakeholders (e.g. AV creators and regulators) should verify and certify AVs, and I think Pegasus is also viewing its work as a precursor to that. Clearly HAD-verification is a subset of AV-verification, and certification-by-regulators includes a subset of verification-by-AV-creators (more on this later), but ideally we should have a general framework which works for the whole range.

The symposium had a series of “stands”, going through the rationale of the approach, the challenges, test generation concepts and so on. I encourage you to take a look: This post is already long enough so I will not summarize it all here, but it’s pretty good.

Describing scenarios: According to the Scenario Description presentation, scenarios have a functional view (described in free text), a logical view (with a set of ranges for the “interesting variables”), and a concrete view (with all these variables given concrete values).

One way to represent a concrete scenario is via OpenScenario – an XML-based format proposed by Vires (and by Pegasus in general) as a standard. In the context of this symposium, it was mainly suggested as an intermediate format, which could be interpreted by the various execution platforms (SW in the loop simulators, HW in the loop simulators, test track setups and so on). This sounds like a good idea.

Another suggested standard is the Open Simulator Interface (OSI) – a standard way to define data structures for weather, signs, sensor inputs and so on. The idea is to use OSI to connect together various simulated artifacts produced by different companies (e.g. a simulated model of sensors, the sensor fusion, the planner, the simulated world around the car etc.).

So, for instance, if you are a Radar supplier, you may create an OSI-connected simulation of your Radar if you want an OEM to be able to evaluate it in his full-system simulation. OSI is defined using the efficient, language-neutral protocol buffers format. This sounds like a good, practical idea – take a look at the current OSI definitions to see how they define weather, signs and so on.

Is the Pegasus approach enough?

From what I could understand, it is actually quite good. However, I believe it should be extended to address additional verification aspects.

The main issue is the need to look for both “expected” and “unexpected” bugs: It seems Pegasus is mainly looking for “expected” bugs (like sensor fusion too slow or braking not efficient enough), and less for “unexpected” bugs (like SW issues and spec bugs – hazards which cannot be enumerated in advance).

I explained expected vs. unexpected bugs here, saying:

Suppose your AV did not stop fast enough for a cyclist, because of some combination of: vision system too weak, sensor fusion too slow, braking not efficient enough and so on. This is an expected bug: You knew all of these could happen in principle – the bug was that under some circumstances the combined performance was less than you signed for. And to fix it you will have to equip your AV with a better camera, or something.

Note that “expected bugs” is a slight misnomer: The fact that the bug happened in a specific design under some specific conditions was not itself expected – there was just an ongoing suspicion that it might happen, because of the underlying, known issues (like “sensor fusion too slow”). Perhaps “bugs resulting from combinations of cataloged, suspected issues” would have been a better name, but I took pity on you, gentle reader.

Historically, most automotive testing falls under what I call “looking for expected bugs”. But as we move from regular cars to HAD to AVs, more and more of the decisions are going to be automated, and that automation needs to handle all the special cases (which a competent driver encounters rarely, but an AV fleet will encounter many times per day). This should raise the relative importance of unexpected bugs (more on this in the example below).

The Pegasus folks are well aware of this issue: Indeed, this 2016 Pegasus presentation warns of the unknown “frequency of new critical situations generated by automated driving and the capability to control them safely”.

However, the symposium mainly emphasized expected bugs. For instance, this presentation models the possible causes of a collision as a fault tree, i.e. an and/or tree of “obstacle not detected”, “wrong object classification”, “braking fails” and so on (search for “fault tree” in the text). As I described here, fault tree analysis is good at what it does, but does not really extend to SW / spec bugs.

A further complication is that there is no fully-agreed-upon dividing line between verification (which tries to find all bugs) and certification-by-regulators. Some people express the view that it is OK for regulators to only look for expected bugs, assuming that AV manufacturers will deal with “unexpected bugs” like SW errors (using different techniques).

However, perhaps a better direction might be:

  • Have a general verification framework, capable in principle of finding the full range of bugs
  • Have AV manufacturers use it to catch the full range of bugs
  • Have certification authorities use it to perform/assess a subset of the verification
  • Have certification authorities (or an independent body or some other group) at least review the more complete verification done by the AV manufacturers

Ideally, that body should also maintain a growing, public catalog of parameterized AV scenarios which test for all bugs (including spec bugs) known or imagined so far. For instance (as I suggest here) once somebody asks “How should an AV react when it notices a Tsunami advancing towards it”, we should add a generalized rare_advancing_natural_phenomena scenario to that catalog.

Next, I’ll dive deeper into unexpected bugs, using an example scenario. I’ll then discuss some possible enhancements to the current Pegasus direction which might help in finding unexpected bugs (e.g. aggressive scenario mixing and on-the-fly input generation).

Example scenario: Approaching a yellow-stoplight junction

Consider the parameterized yellow_light scenario in Figure 1:

yellow_light_2Scenario description: Our AV approaches a stoplight as the light turns yellow, with another car (the “behind car” or Bcar) close behind it. Note that the AV is the Device Under Test (DUT, also called “ego vehicle”), and thus is not under (direct) control of our testing system, but everything else is. Thus, we can change the behavior of the other cars, the stoplight and so on, to match the scenario we want to achieve (within reason).

The AV will either stop before the stoplight or continue and cross the junction in yellow. If it crosses, another car (the “cross car” or Ccar) will accelerate into the junction across the AV’s path, perhaps hitting it.

The main possible outcomes are “AV crosses safely”, “AV hit by Bcar”, “AV hit by Ccar” and “other bad side-effect”. Note that a hit outcome does not necessarily mean a DUT error: We are going to run this scenario many times with different parameters, and in some it is impossible for the DUT not to get hit.

Scenario parameters: There can be many possible parameters to this scenario, such as:

  • AV: Speed and distance from junction when light turned yellow
  • Junction: Dimensions, time between light changes
  • Bcar: Speed, distance and subsequent behavior
  • Ccar: Initial acceleration, how soon after / before green
  • Global conditions: Weather, road conditions, lighting
  • Country rules and conventions: Yelllow light conventions, drive on left/right
  • Other things going on: Other cars and people, objects in the junction, etc.

Coverage: We probably want to run this scenario many times, collecting implementation coverage (e.g. code coverage), and more importantly functional coverage. For instance, we may decide that for the yellow_light scenario, we are going to collect the following functional coverage items:

  • The speed of the AV when the light turned yellow (split into, say, 5 speed “buckets”)
  • The speed of the Bcar relative to the AV at that time (say 3 buckets)
  • The outcome (4 buckets: Crosses-safely, hit-by-Bcar, hit-by-Ccar, other-side-effect)
  • The cross of all of these (5 * 3 * 4 = 60 buckets)

In reality, the list of potential functional items is huge – choosing the right subset and crosses is part of the art of Coverage Driven Verification (more on this below). Here are some of the candidates for functional coverage:

  • The full list of scenario parameters above
  • The “results”: Did the AV stop or go? Was it hit or not?
  • Whitebox coverage taken from the DUT: When did it notice the Ccar moving, etc.
  • Other kinds of coverage such as “criticality coverage” (how close we got to an accident)
  • Which execution platform was used for the run: SIL, HIL, test track, etc.
  • Scenario mixing: Did it run during/just-before/just-after other scenarios like ambulance_arriving, sensor_failure etc.
  • Some crosses of these coverage items

After enough runs were performed, we may further want to target the simulations so as to fill the remaining “coverage holes” (i.e. the coverage buckets which never/seldom appeared in any of the runs, but which can, in principle, happen).

Checking: Note that a hit outcome does not necessarily mean a DUT error: The AV may have stopped correctly and yet the Bcar failed to stop in time, or the AV may have crossed correctly and the Ccar accelerated into the junction too early.

In general, we want to check that the car behaved “correctly”: That it chose “reasonably” among the behavior options, executed them well, honked if it entered the junction very late (to warn other cars), and so on.

Expected and unexpected bugs:  Bad behavior could result from either “expected” bugs (sensor fusion too slow, planner misjudging road wetness etc.) and “unexpected” bugs.

Unexpected bugs are either plain SW bugs, or spec bugs. A SW bug (say an off-by-one or NULL-pointer bug) could cause, say, the braking planner to misbehave, but only when some buffer is full and an urgent car-to-car message comes in.

Spec bugs: A spec bug can result when the AV spec writers / programmers simply fail to take into account the full range of considerations and events that could impact the scenario. Here are just some examples of things that could be overlooked and thus cause a spec bug (though obviously competent design / verification teams are unlikely to miss them):

  1. Before stopping at a yellow light, you should “look back”
  2. Yellow-light laws and conventions are different between countries
  3. You may need to enter a junction even in a red light, if the Bcar is very close and the junction is empty
  4. A close Bcar + bad brakes could cause you to come to a stop in the middle of the junction, in which case you should get out even in a red light
  5. When crossing a junction too late, you should probably honk
  6. When crossing a junction too late, the danger pattern is different in left-driving countries, in countries with unprotected left turn etc.
  7. If the front video cameras stopped working just before the junction and you need to quickly find a safe place to stop, remember that Lidar does not see stoplight colors

Any specific spec bug is highly unlikely, but the potential list is huge. And it grows even bigger as you start extending the yellow_light scenario: See for example the post Verifying how AVs behave during accidents.

Consider also that country-specific rules and regulations may interact with scenarios in complex ways: Say the AV is closely followed by the Bcar, the light turns red and the junction is almost empty: It may be safer to cross on red (1% chance of an accident) rather than stop (50% chance), but if that 1% does materialize, what will the country’s rules say?

Finally, there are all the complex considerations regarding human behavior, which are often locale-specific (see Verifying interactions between AVs and people).

My point here is that even a fairly compact scenario has a huge cloud of potential considerations around it – enough for a whole herd of spec bugs to hide in.

Implications for writing AV scenarios

I hope I was able to convince you that we should target AV verification at both expected and unexpected bugs (with an emphasis on spec bugs). I’ll try to suggest below how scenario-based AV verification can be extended to do that.

Spec bugs really are a big deal. In a sense, they are worse than SW bugs (though both are unpredictable). SW bugs are, at least, on people’s minds, and there are established methodologies (e.g. redundancy and fault containment regions) for coping with them (though those also have issues – see here).

But (and let me slip into bold here) if nobody thought about what a stationary AV should do when a car backs into it, then all that triple-redundancy and ASIL-D SW certification are not going to help much. The AV will (very reliably) do the wrong thing.

See also It’s the spec bugs that kill you.

Finding spec bugs is hard. If we could enumerate all the considerations which can cause spec bugs (like I enumerated the seven items above), then we could simply add the corresponding variants to the yellow_light scenario (with appropriate checks).

But we can’t, of course – that’s the whole point about spec bugs. Nor is there a sure way to find all of them – we should unfortunately assume that the process of looking for them will continue after deployment: That catalog-of-scenarios-for-all-imagined-dangers I mentioned earlier will keep growing.

However, we should do our absolute best to find those bugs as soon as possible. Here are some helpful techniques, and how the current Pegasus approach may be extended to accommodate them (note again that this is based on my possibly-naïve understanding of what Pegasus currently does).

Systematic Coverage Driven Verification: One technique which is clearly helpful is using Coverage Driven Verification with an emphasis on corner cases and automatic coverage filling – see Verification, coverage and maximization – the big picture. Note that this involves writing very general abstract scenarios which can be instantiated into many different concrete scenarios.

Pegasus currently seems to be advocate Monte-Carlo randomization using expected distributions, which are good for estimating the frequency of failures. However, to find unexpected bugs you need bug-finding distributions, which are good at reaching edge cases (but not so good for estimating failure frequencies – see related discussion here). In other words, you want distributions that ensure rare events happen more often in the hope that unexpected bugs occur more frequently than they otherwise would.

Aggressive scenario mixing: Another technique is aggressive scenario mixing: Pegasus (and many similar approaches) currently emphasize “sterile”, stand-alone scenarios (albeit with many parameters). Those are indeed easier to understand and debug initially, but to find unexpected bugs you really need to run aggressive, messy mixes of scenarios.

For instance, if both the yellow_light and the sensor_failure scenarios are in the mix, there is at least a chance of discovering spec bugs related to item 7 above (“If        the front video cameras stopped working just before the junction and you need to quickly find a safe place to stop, remember that Lidar does not see stoplight colors”).

On-the-fly generation: Yet another helpful technique is on-the-fly test generation: In complex situations with lots of mixed scenarios it is really impossible to completely plan in advance, and thus test generation needs to be opportunistic and depend on the current state of the world. Pegasus currently seems to favor the pre-generation approach (generate a test, write it as an OpenScenario file, then run it).

Will this help?  Constraint-based Coverage Driven Verification (CDV) is currently the main technique for finding sub-system HW bugs (note that HW folks do not usually stress the difference between implementation and spec bugs). Scenario-based CDV is now becoming established in SoC / embedded SW verification. And fuzzing (a somewhat-similar technique with many variants) is fairly successful in finding security bugs.

AV verification is different from all of these, but I think these techniques point in the right direction (for a fuller discussion, see that spec-bugs post).

Note: Foretellix (my company) is building a scenario-based AV verification system.

To summarize: Verifying and certifying AVs (to a satisfactory level) is going to be hard, no matter what. Perhaps, as a somewhat-pessimistic Wired article recently suggested, “the last 1 percent is harder than the first 99 percent”. Thus, we should make sure that the approach we take is applicable to the entire job.

Notes

I’d like to thank Gil Amid, Sandeep Desai, Ziv Binyamini, Amiram Yehudai, Thomas (Blake) French, Kerstin Eder, Zohar Zisapel, Sankalpo Ghose and Eric Chan for commenting on an earlier draft of this post.

 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s