Summary: This post discusses the influence of the Uber accident on Autonomous Vehicle (AV) deployment. It claims that AVs should eventually be deployed, and yet that we should expect many fatal AV accidents. It then suggests that a comprehensive, transparent verification system could help solve this inevitable tension.
That tragic Uber accident has brought AV safety into sharp focus. Brad Templeton wrote an excellent summary of what is known so far, and others have weighed in on how people are more afraid of things they can’t control, on the need for third-party testing, on the insurance implications of all this and so on.
As I write this, there are still open questions. What is clear is that the victim, a homeless lady pushing a bicycle, crossed several lanes in an unlit, no-cross section of the road, before being fatally hit by the AV. There was probably enough time for a human driver to stop. Further, the AV’s Lidar (Laser radar), and probably other sensors, should have given enough advance warning for the AV to stop in time. Why did that not happen?
One possibility is some weird un-verified corner case (e.g. a person wearing dark clothes walking a bike with lots of bags) – I mentioned before that Machine Learning (ML) based systems (often used for scene understanding) are especially hard to verify comprehensively. But perhaps this was not a bug at all: One rumor says the Lidar was turned off for test purpose, and perhaps it was something else altogether.
Regardless of the specifics of this incident, this post will look at the bigger picture of AV deployment, safety and verification, and expand on the following claims:
- Beyond a certain safety threshold, AVs should be deployed
- While safer than human drivers, AVs will continue to have many fatal accidents
- AV manufacturers and regulators should employ a well-thought-out, comprehensive, continuously-improving, multi-execution-platform, transparent verification system
Let me elaborate on all that.
Beyond a certain safety threshold, AVs should be deployed
Consider figure 1 below: It assumes (oversimplifying somewhat) that human-driver safety stays constant, while AV safety keeps improving. It also splits AV fatalities into “unavoidable accidents” and “AV bugs” (note that as AVs improve, some previously-unavoidable accidents will become avoidable).
Where is AV safety right now on that yellow curve? That’s hard to say. In the US, human drivers have about one fatality every 100M miles, so Uber (with one fatality after just ~2M miles) looks pretty bad. However, the best AVs (say Waymo’s) are probably much safer, and constantly improving.
The accident has certainly lowered people’s perception of AV safety. Uber, in particular, may have taken the phrase “Move fast and break things” a bit too literally. In the above post, Brad says:
I suspect it may be a long time — perhaps years — before Uber can restart giving rides to the public in their self-driving cars. It may also slow down the plans of Waymo, Cruise and others to do that this year and next.
But I think many will agree that beyond a certain level of AV safety (assuming it can be demonstrated convincingly), society has a moral duty to deploy AVs (so as to prevent all those deaths between the black and the yellow curves). That “should deploy” date will be different for different domains (e.g. limited areas of Phoenix in good weather, all of Boston in any weather, all of Bangalore in any weather etc.), and I am not going to guess the dates here, but they will come.
There are also, of course, huge commercial incentives for deployment, but let’s stick for now with “what’s the right thing to do”. And it seems that the right thing to do is:
- Find a reasonable way to determine the correct deployment date
- Do whatever we can to improve AV safety as quickly as possible (both before and after that date)
That first bullet is tricky, though, for various reasons. Here is one:
There will probably be many fatal AV accidents
Consider a future date (say in the US), where AVs are clearly 10 times safer than humans (just how we determine that will be discussed below). At that point, we are probably well into “should deploy” territory. Assume further that by that date, a full 10% of driven miles are driven by AVs: This is clearly good, since each of these miles is 10 times safer than a human-driven mile.
But even then, we should expect about one fatal AV accident per day, right? There are about 100 fatal car accidents per day in the US, and 100 * 10% * 10% is 1.
Each of these daily accidents will (usually) be less news-worthy than that Uber accident, but will still be scrutinized much more than a human-caused accident. Was it unavoidable? Was it a bug? How well was that scenario verified by the AV manufacturer?
So there is a need for the various stakeholders (the public, lawmakers, regulators, AV manufacturers etc.) to agree on some general framework for handling these accidents (and the whole deployment process). That framework should ensure, among other things, that:
- Not every accident results in a lengthy, billion-dollar lawsuit
- Negligent AV manufacturers do get punished
- Everybody (the public, the press, judges, lawmakers, regulators etc.) has an understandable way to scrutinize the safety of various AVs, both in general and as it relates to a specific accident scenario
This is going to be a non-trivial framework: It will surely have legal and regulatory components. It will probably include ISO-style “process” standards, such as ISO 26262, the SOTIF follow-on, and the expected “SOTIF-for-AVs” follow-on to that. It may contain a formal component and more.
But I think the central component (tying all others together) is going to be a verification system. Let me try to convince you of that:
The need for a well-thought-out verification system
I think this framework should be based on a verification system, which lets you:
- Define a comprehensive, continuously-updated library of parameterized scenarios
- Run variations of each scenario many times against the AV-in-question, using a proper mix of execution platforms (such as simulation, test tracks etc.)
- Evaluate the aggregate of all these scenarios / runs (and any requested subset of it), to transparently understand what was verified (this is called “coverage”) and what “grade” it got
Such a coverage-driven verification system (enhanced by ML-based techniques) is probably our best bet. It will also be a crucial component for improving safety as quickly as possible (see this post for more details about coverage driven verification).
Here are some of the main attributes of such a verification system:
It should be comprehensive: The scenario library should be comprehensive along many dimensions. Here are some examples:
- It should cover both avoiding accidents, and behaving during / after accidents
- It should pay specific attention to interactions between AVs and people, and grade for both accidents and annoyances
- It should emphasize the verification of ML-based algorithms
- It should use “normal”, expected-frequency scenarios, but also (and primarily) corner-case-directed, bug-finding scenarios. Similarly, it should use both digitized topology of actual cities, as well as made-up topologies.
- It should look for both “expected” and “unexpected” bugs (the example bug in figure 1 is an “unexpected” bug)
- It should cover all the scenarios mentioned in the appendix of the Waymo safety report, all the scenarios allegedly not covered by Cruise (paywalled), all the Pegasus scenarios and many more
- It should be relatively easy to modify per locale / domain / driving restrictions etc.
A big challenge is making the library comprehensive without making it unwieldy. For instance, a huge (and exponentially growing) spreadsheet / database of test cases will simply not scale: Beyond a certain size it will be neither transparent nor maintainable.
Much of the heavy lifting will have to be done by the Scenario Description Language in which the library will be written (and by the related tools). SDL should be constraint-and-coverage-based, and extensible.
In particular, the system should constantly try to mix various scenarios, parameter values, topologies, weather conditions etc. (so as to find new bugs), without the user having to specify, or even think of, every combination. For instance, if that Uber accident was indeed caused, say, by mis-classification of a person wearing dark clothes walking a bike with lots of bags, then we would expect such a system to find this bug (with good probability) without anybody having to think of this exact combination in advance.
On the other hand, a user should be able to define any such combination C (however specific or general) using functional coverage definitions. The user should then be able to ask the system: “Show me all instances of C in last week’s runs”, “Show me the instances of C which got the worst grade”, “Show me a graph of how we did on C over the last few releases”, or even “During tomorrow’s runs, tweak the various parameters so as to get many more instances of C”.
It should be continuously-improving: Even with the best verification system, it is impossible to consider everything ahead of time. As I described in The Tesla crash, Tsunamis and spec errors, some scenarios (like what an AV should do when it encounters a Tsunami) may only occur to people at a later stage.
We may realize (e.g. following a specific accident) that we are missing some scenarios, or that our coverage “mesh” is not tight enough, or that our grading function is incomplete. We may also want to automatically extract scenarios from recordings (of dashcams, static cameras, AV recordings etc.) – see “Extracting scenarios from recordings” in this post for some of the considerations.
Thus there will be a need to continuously update / enhance that scenario library, in a collaborative and safe way. If-and-when there will be some form of an open, standard scenario library (complete with coverage and grading definitions), then there will be e.g. the 2021 standard, the more-comprehensive 2023 standard, and so on.
It should run on many execution platforms: It was already becoming clear (as I reported here) that most verification should be done using simulations, because this lets you explore many more unconsidered scenarios in a scalable way. The Uber accident will probably accelerate this trend, because simulations let you do many dangerous things without the risk of causing damage or actually killing people.
However, verification should also be performed using other execution platforms (such as vehicle-in-the-loop, automated test tracks, street driving and so on). These execution platforms, and their sub-configurations (e.g. see this post) have various tradeoffs (e.g. realism vs. cost vs. speed), and trustworthy verification should involve judicial usage of several of them. Ideally, you should be able to use the same scenario definitions for driving / measuring scenarios on all these platforms.
It should be transparent: Verification transparency is key here: Many stakeholders need to see how well various scenarios were verified, without having to go into deep technical discussions (of SW, ML etc.).
For instance, the public would like to check that a specific accident scenario (say an unprotected left turn during fog) was indeed verified for that AV. People may also want to see how related areas were verified (e.g. unprotected left turns in any bad weather, or other challenging driving scenarios during fog).
The ability to scrutinize this using commonly-understood terms will be especially welcome for accidents which do go into litigation.
Regulators will need a common, clear way for specifying and tracking verification compliance, using terminology they can relate to (and perhaps dictate). Note that the verification system should also be able to give a rough estimate of the projected fatalities-per-mile under various conditions, though this is pretty hard to do.
Note that regulators and lawmakers (at least in the US) stress the need to evaluate safety (rather than technology), saying things like:
We’re looking at ways to evaluate outcomes. Instead of a regulation that says, ‘Machine must have A, B and C in a vehicle’, we hope to look at how safe a vehicle is at the other end.
Finally, the AV providers themselves will probably welcome a clear, transparent set of verification requirements: As somebody who is familiar with several OEMs recently told me, once they get clear requirements, they’ll make sure to exceed them by 20% – they just don’t have them yet.
To summarize: I hope I have been able to suggest why AV deployment should continue despite future accidents, and why a well-thought-out verification system could really help. Note that Foretellix is working on such a system, but this is something that no single company or organization can do all by itself.
I’d like to thank Ziv Binyamini (who BTW just joined Foretellix as CEO), Gil Amid, Sandy Hefftz, Moshe Gavrielov, Brad Templeton, Thomas (Blake) French, Yaron Kashai, Amiram Yehudai, Ohad Schwarzberg, Sankalpo Ghose and Kerstin Eder for commenting on earlier drafts of this post.
A note about the Foretellix blog: This will remain an ideas / technical blog. For updates about my company, Foretellix (team, advisory board etc.) please see the Foretellix web site.
This post leaves a lot of open questions, some of which will be discussed in future posts. Comments are very welcome.