Summary: This is another one of those “misc. stuff” posts, with no unifying theme other than “Interesting inputs regarding Autonomous Vehicles verification”. It will discuss: What I learned regarding the ASAM OSC standardization effort, DeepMind’s “Rigorous Agent Evaluation” paper, Tesla’s “400,0000-car regression farm” idea, some good papers by Philip Koopman, and the upcoming Stuttgart symposium.
ASAM impressions, take 3
Gil Amid, who is (among other things) the Foretellix guy for regulation and standardization, organized the ASAM OSC 2.0 informal discussion group over the last few months. Take a look at the presentations there: I found them to be a pretty good (and diverse) source of requirements for what a scenario language / framework should look like, representing the views of OEMs, regulators, AV newcomers and others.
Last week Gil and yours truly attended the kickoff of the ASAM OSC 2.0 “concept project”. Overall, it seemed to go in the right direction, but there is certainly a lot of work to do, starting with terminology.
I talked before about how the various “verification tribes” use slightly-different terminology for things, and described how:
Any specific speaker uses his (her) own terminology, secure in his knowledge that it is the only one possible. This leads to the dreaded assumed terminology trap, where he is happily carrying on, while halfway through his presentation you are still doing pattern recognition on what he said so far, to understand what he is talking about.
One sometimes wishes to be able to click on the speaker to pause him, then click on his last words to jump to his definition of them. Barring that (and yes, you can take that excellent startup idea and run with it), perhaps a common terminology file could help.
And indeed I maintain a terminology page to keep track of “what-tribe-X-calls-thing-Y”.
Sadly, not everybody in the known universe studies this page as much as they should. For instance, in that ASAM meeting I said “I suppose we can all agree that a ‘test’ is (represented by) a file specifying the configuration + top scenario, and a ‘run’ is what you get when you actually run that test with a specific random seed”. Well, apparently we cannot: Some people use the term ‘testcase’ for what I call a ‘run’, while some use the term ‘test’ to specify what I call a ‘check’ (as in “you subject a run to multiple tests to see if it ran well”).
Nevertheless, I am still fairly optimistic about the process: While the terminology issue is non-trivial (and will need to be solved in coordination with other groups / standards), we did move forward to surface several important topics, and the atmosphere was friendly and helpful.
One of the more technical topics we discussed was the general block diagram / interconnect architecture of the scenario-based framework (see Fig. 1 below, slightly adapted from my presentation there):
Note that fig. 1 is not drawn “to scale”: The various simulators / models at the bottom may be much bigger (code-wise) then the scenario engine in the middle.
Indeed, I claimed that one of the main challenges for creating a standard scenario-based framework is that organizations already have a lot of verification-related assets (simulators, models, libraries etc.) which they use productively. Those may represent a huge investment which cannot be thrown away. Thus, much of that discussion centered on how to simplify the reuse of existing assets in the new framework (hint: see those blue adapters in the picture above).
This was just one of several good discussions. We’ll be back in Munich 6..7-June for the next face-to-face meeting (Gil is now the elected leader of that concept project), and I promise to report from that event.
DeepMind’s “Rigorous Agent Evaluation” paper
DeepMind has an interesting late-2018 paper talking about doing bug hunting and risk estimation of Machine-Learning-based systems.
Like many others, they build a second, “adversary” ML agent which learns how to find failures in the original, tested ML agent. However:
The classical approach does not work well for our setup, because failures are rare and the failure signal is binary. For example, in the humanoid domain, the agent of interest fails once every 110k episodes, after it was trained for 300k episodes. To learn a classifier f we would need to see many failure cases, so we would need significantly more than 300k episodes to learn a reasonable f.
One technique they employ to solve this problem is to use previous versions of the original ML agent (i.e. before it was fully-trained). Those (naturally) have more failures, and that apparently lets the adversary agent learn the kind of failures that would persist even after the training. They then use this adversary to achieve pretty-good speedup over “Vanilla Monte Carlo”, for both failure hunting and risk estimation.
Yes, this is only geared towards verifying ML-based components, and the example components don’t have very many input variables, but I feel that combining such techniques with “normal” scenario-based verification can be pretty promising. Some of that has to do with adding logic “on top of” existing scenarios (e.g. in that green “Management” rectangle at the top of fig. 1).
BTW, I hope to blog soon about how various techniques for bug hunting and risk estimation (and other safety-related activities) interact with Coverage Driven Verification.
Another interesting research regarding ML-based bug-finding is the work on Adaptive Stress Testing from Stanford and others. See also the first chapter of this post for a summary of where ML can help verification.
Templeton on Tesla’s “400,000-car regression farm” idea
Brad Templeton has a good post on Tesla’s “Shadow” driving and how they test Autopilot. Here is his description of “shadow testing”:
In shadow testing, a car is being driven by a human or a human with autopilot. A new revision of the autopilot software is also present on the vehicle, receiving data from the sensors but not taking control of the car in any way. Rather, it makes decisions about how to drive based on the sensors, and those decisions can be compared to the decisions of a human driver or the older version of the autopilot.
His opinion on the topic is similar to mine: This is an excellent idea but it is not enough. He says:
You’ll never see everything strange that you need to see even in millions of miles of shadow testing, so simulation remains important. Tesla has a simulator, and uses it, but takes a dim view of simulation compared to shadow testing.
This deserves a longer discussion, but briefly, here are some of my reasons for thinking simulation is indispensable even in that setup (note that some of them are mentioned in Brad’s post, and Tesla seems to be aware of many of them):
- If the new version of the SW makes a different decision from the old one, you cannot check the consequences of that decision (except in simulation). Assuming that only “big differences” are important may be problematic.
- This does not give you a measurable way to track functional coverage of scenarios between releases, nor is it repeatable
- It is limited to the ODDs / locations / weathers in which Tesla has lots of cars during the comparison period. Thus, a constantly-expanding fleet will encounter many new edge cases after deployment.
- Even for a specific ODD (say the SF bay area), if the final version of the new SW was tested for, say, 3 months, it may still not have encountered enough edge case (because perhaps it did not see any rain, because 3 months are a short time etc.)
Again, this does not take away from Tesla’s idea. I just claim that simulation (when combined with techniques like Coverage Driven Verification) is a much more efficient way to drum up edge cases and measure where you are, and thus both should be used.
Some good Koopman papers
Philip Koopman et. al had a bunch of good papers lately. If you are looking for a good starting point for enumerating risk dimensions in your verification plan, you could do worse than looking at these three papers:
- How Many Operational Design Domains, Objects, and Events?
- Credible Autonomy Safety Argumentation
- Toward a Framework for Highly Automated Vehicle Safety Validation
They also have a detailed paper on constructing a safety case (though it is limited to the safety case for “doing road testing with a safety driver”). I described how the safety case relates to other safety-related artifacts (like the verification plan) in the last chapter of this post.
News flash: Philip also has a new Autonocast podcast – I am just 20% through it, and I find it pretty interesting.
It’s that Stuttgart time again: The Stuttgart Autonomous Vehicles Test and Development Symposium is 21..23-May. It is always an interesting event, and a good way to feel the pulse of the industry – see my coverage of the 2018, 2017, 2016 and 2015 incarnations.
Quite a few of us (Foretellix) will be there. I’ll be giving a presentation (on the first day), and we also have a booth. Send an email to firstname.lastname@example.org if you want to meet.