Finding bugs in autonomous vehicles: My impressions from the second Stuttgart symposium

Summary: This trip report describes my impressions from the Second Autonomous Vehicles Test & Design Symposium in Stuttgart. Short version: The design side is moving fast. The verification side has improved, but has a long way to go.

Note: In my last post I talked about what automation will do to this planet over the next 20 years, and promised a follow-up about verification, coming soon. This post is not it, but is very related: Autonomous Vehicles (AVs) are arguably the next big wave of (physical) automation, with autonomous drones / robots / etc. coming somewhat later.

Old-time readers of this blog may remember my report about last year’s Stuttgart symposium, in which I basically said that I was not very impressed with the state of AV verification. Throughout the year I kept hearing that AV design is progressing quickly, and thus AVs are going to be here sooner than originally expected. So last week I attended (most of) this second Stuttgart AV symposium, to see whether the verification side of things is catching up.

So is it? Well, things have certainly improved quite a bit (and it is a pretty hard problem to solve), but AV verification clearly has a long way to go: The whole process seems too disjoint and inefficient, virtual verification is way-underutilized, there is no accepted methodology (or practice) for large-scale definition of scenarios and functional coverage, there is no principled way to decide which subset of that coverage to run using Hardware In the Loop / Driver In the Loop, and so on.

Note that the usual caveats apply: I am an outsider to this industry, so some of the strong words I am using could be way off. And AV verification is more complex and interdisciplinary than the chip verification industry I am coming from (which seems to have solved many of these problems about 20 years ago using Coverage Driven Verification). And quite possibly some companies are doing better stuff and not talking about it, and so on. Still, I think I am not way off, and I got confirmation for that from several other attendants – see below. I do invite you, kind reader, to comment.

A final note: Please use the terminology page if you are coming from outside the automotive industry and don’t know what HIL or NCAP mean (or if you are coming from outside the chip verification industry and don’t know what CDV or functional coverage mean).

OK, with this out of the way, and before I get into the details, let me explain why I think there is going to be a strong push to fix these problems fairly soon:

Expect a strong push to “fix” AV verification

Let me start with a weird thing I noticed in casual conversations during the symposium: Whenever I said “let’s consider the implications of the first few AV-related fatalities”, most people reacted with “Oh, we hope that does not happen. This could really set back the whole field”.

But this is clearly an unreasonable thing to say, right? The US currently has around 30,000 vehicle-related fatalities/year. Say we replaced 5% of those cars with AVs, and say AVs are 5 times safer. That’s 300 AV-related fatalities/year. Scale that down further if you like: That still gives you several times/month when a lawyer could stand in front of a jury and say something like “My client’s child was killed because this AV manufacturer was negligent”.

That’s going to happen, right? And it is going to be front-page news. And an endless parade of experts will be called to the witness stand. But AVs will not be banned, because 5X less fatalities is still an incredible thing. And by the time it is all over, we’ll have requirements for stricter verification. The AV companies will help push that process, because they need to know where they stand. And because this stricter verification will be a lot of work, many of the inefficiencies in the current process will be removed.

All that should not be viewed as some unexpected terrible thing: I think we can consider this the (implicit) plan of record. And it will clearly give a big push to better verification. Other things might push it even earlier: NCAP etc. might add a significant component of “virtual testing” (more on that below), and some AV manufacturers might go out of their way to set a good example. In general, there seem to be a lot of action and energy in AV verification right now, so perhaps it will improve soon.

The pull of the physical

 Perhaps the biggest issue which I think needs fixing is the over-reliance on “physical” testing, in which regulators / judges / AV manufacturers rely almost exclusively on physical testing of actual cars in actual testing grounds.

The history of NCAP helps explain this (very understandable) over-reliance: Initially some car manufacturers claimed there was no need for all that crash testing: Their internal high-quality standards and top-notch engineering should be enough. NCAP started crashing cars anyway, and in 1997 just one family car achieved 4-star performance. Now almost all do, and some of the credit for this improvement goes to NCAP’s attitude of “we love all those engineering discussions but if you don’t mind we are going to physically crash-test your car anyway”.

I heard that German judges almost invariably demand physical evidence in accident trials, probably for similar reasons.

This “show me the real thing” attitude has served those stakeholders well so far, but it does not scale when moving to autonomous vehicle: there are simply too many cases to consider.

To show that, let me turn back to NCAP: We had two presentations about their plans for adding AV-related tests over the next few years. For instance, by 2018 they plan to add 5 tests to see how well Autonomous Emergency Braking (AEB) helps avoid hitting cyclists. The 5 tests are for a cyclist coming from the left, from the right, from the right but hidden by building, driving ahead of the car, and ahead-and-to-the-right. All that will be done at predefined speed and lighting conditions, using standard cyclist dummies.

Your first reaction is probably “great – this will save quite a few lives”. Your second reaction might be “but how about multiple velocities, day/night conditions, rain/no-rain, multiple angles of approach, multiple-colored dummies, etc.” (BTW, there is already a running joke about people optimizing their sensors for the standard NCAP black-shirt-blue-trousers pedestrian dummy).

No wonder, then, that several people (including yours truly) asked the NCAP guys about virtual testing. I talked to Nicholas Clay (who presented the Euro-NCAP roadmap) about that, and got the following picture: NCAP is a fairly small organization, and this all takes planning and execution. However, they are considering virtual testing, details to be determined (I plan to continue tracking this).

It will probably be in the spirit of something they already did: Namely, they define a matrix of test conditions, the manufacturers take that matrix and report back their results, and then NCAP spot-checks some (randomly-chosen) cells of the matrix to verify. Note that Nicholas called this process “virtual testing”, but also did not rule out the possibility that the manufacturers will do (most of) their testing of the matrix in simulation (the original meaning of “virtual testing”).

This looks like a very reasonable next step. But consider that this 5x2x2x… matrix tests just one component of autonomous vehicles (AEB), avoiding one kind of accident (hitting a cyclist). So we are talking about a much bigger space to test.

And then we need to talk about sequential complexity: Most current AV testing looks at a single performance parameter of the system, e.g. time-from-cyclist-observed-to-car-stopped. But many of the issues may only appear when a particular sequence happens: e.g. a long turn in foggy conditions, followed by a badly-marked dividing line, followed by that cyclist.

One can look at single-performance-parameter testing as a form of sub-system testing:  even though it involves many physical sub-systems, it tests just one sub-functionality. We all know that sub-system testing in necessary-but-not-sufficient for verifying full-system functionality: Perfect components can be composed to create a buggy system.

This single-performance-parameter mentality often goes together with concepts like Fault Tree Analysis and MTBF (Mean Time Between Failures). These are pretty useful concepts, but they do not quite work for SW bugs (though people have certainly tried).

For instance, what was the MTBF of the Ariane 5? If you must answer, then the right answer is probably “40 seconds”: That’s how long it took the SW bug (which was guaranteed to be activated on that maiden flight) to trigger self-destruct.

AVs, with their many millions of code lines, are bound to have bugs just sitting there waiting to occur under very special circumstances. Concentrating only on single performance parameters like time-from-cyclist-observed-to-car-stopped is like those Ariane 5 engineers checking for the N’th time that the engines have enough thrust – important, but clearly not enough.

A further complication is the need to not only try to avoid failures, but also to check what happens after some initial, non-fatal failure. For instance, the Fukushima reactor designers miscomputed the needed height of the sea wall, but then (separately) neglected to protect the backup diesel generators, which were flooded by the tsunami, thus dooming the reactor.

Enough with the problems – let’s talk solutions

OK, so I explained why there is a need to check many more scenarios (and parameters within each scenario) then is done currently. How will that be done?

Here is the gist of my suggestion – the next section discusses possible objections to this vision:

  • Use CDV to run many repeatable tests with many random seeds
    • Do mainly virtual testing of all those scenarios and their parameters
    • Define functional coverage and track it
    • Integrate the coverage notation with fault tree / criticality notations
    • Use various techniques to maximize (auto-fill) coverage of all kinds
  • A subset of those tests should also be run in the various physical platforms (HIL, Driver In the Loop etc.)
    • Define a principled, repeatable methodology for what-runs-where
    • When the test cannot be run as-is on the physical platform, create technology and methodology to transfer at least the “spirit of the test”
    • Similarly, define a principled methodology for combining sub-system verification and full-system verification, and share that with sub-system providers
  • Make the results of the verification transparent (in addition to the ISO 26262 requirement of process transparency)
    • e. have an easy way to display the scenarios / coverage / platform-run-decisions etc. to all stakeholders
    • Ideally, define an industry-wide minimal catalog of scenarios and coverage definitions, and an industry-wide convention of what-should-run-where

Objections, your honor

Here are some possible objections to my proposed direction, and my answers to them. Not all have easy answers, but this is important enough, so I think eventually some answers will be found:

There are too many cases – we clearly can’t check them all: Indeed. But there are standard ways to define scenarios and functional coverage to handle this. In a nutshell, your coverage space should reflect your fears: For instance, if you think there are many potential behaviors (and possible bugs) in the interactions of some parameters, you should achieve cross-coverage of all of these parameters, else you need to just cover each parameter independently.

Obviously, deciding what scenarios to define, and which parameters to use for each, is domain-specific and non-trivial. Note also that since you will run your random tests many times with different random seeds, you will get unanticipated scenario and coverage mixes, even if you did not plan them (or even measure them).

Note: At this point, my readers from the chip-verification side of the house are smiling under their mustaches – this is all basic stuff in their world. Don’t look so smug, folks: You did not even know what HIL was 5 minutes ago.

There are too many cases to test – only formal verification will save us: This is another common-but-wrong idea. Yes, formal verification can help, but it can’t really verify end-to-end behavior of millions of SW lines + HW + machine learning + human variability. See also FV has much better PR than CDV.

Simulation models are missing / inexact: That’s a big problem, exacerbated by the fact that companies don’t like to give other companies their models for fear of leaking IP. This problem will not be solved without a strong regulatory push, and the understanding of the value of simulation. Also, I hope simulation models will be written such that they “cover the behavior from above”, I.e. such that each possible behavior will also appear in the model. When a “possible bug” appears during a test, we can then take that run to some physical testing platform to see whether this bug can indeed occur.

How can you even think of all the dangerous scenarios and their consequences: Some of the dangerous scenarios will simply appear by virtue of CDV / coverage filling, but some will not. Finding spec bugs is indeed a big problem – see the post It’s the spec bugs that kill you for a discussion of the problem and suggested, tentative solutions.

It is pretty hard to create realistic, completely-synthetic inputs: This is indeed a big problem: For instance, creating a synthetic, believable LIDAR stream with the matching video stream is pretty hard. So (at least currently) you may need to use pre-recorded streams, and e.g. superimpose people-and-animals-jumping-in.

As I have described here, this is what I think Google is doing. They talk here about doing “three million miles of testing in our simulators every single day”, which (assuming 1:1 simulation speed) translates to a few thousand cores constantly running simulations.

Machine-Learning components (a growing part of AVs) are really hard to verify: For a description of this problem and possible solutions see this post.

Some interesting things I saw

Finally, here are some of the interesting presentations I saw (and discussions I had) during the symposium, described in the context of the previous sections. I hope to keep in touch with some of these people.

Carina Björnsson of Volvo kicked off the symposium with an inspiring description of Volvo’s Drive Me project, which starts next year by letting 100 customers drive autonomous Volvo cars around Gothenburg, Sweden. Volvo has set the somewhat-crazy goal of eliminating all deaths in new Volvo cars by 2020. I don’t know if they’ll achieve that, but (talking to Carina) it seems they have thought of many of the important verification issues, like virtual testing and which of the virtual tests should be moved to which kind of physical testing.

Mugur Tatar of Qtronic gave a good talk about “Automated search for critical situations for advanced driving assistance functions”. He described Qtronic’s TestWeaver tool, which seems to do many of the right things: Defining coverage (code, state, requirement), doing many runs, and trying to maximize both coverage and criticality measures.

Criticality measures are a measure of how close we are to a potential problem (e.g. to a collision, using some simplistic assumptions). TestWeaver can rerun with the same seed while tweaking (mostly-continuous) fields so as to try and cause even-higher criticality measures. A nice touch.

They have their own simulation backbone (Silver) and can connect to Simulink and various other automotive-related simulators.

We had a long discussion bemoaning the fact that people don’t do enough complex-scenario virtual simulations (in fact, Qtronic sells many more copies of Silver than they do of TestWeaver). Among other things he brought up the fact that there is a great shortage of good, agreed-upon mechanical simulations of various things.

Willibald Krenn of AIT gave another good talk about smart CDV-like testing. AIT is interested in both AV design (e.g. AVs for construction sites) and AV verification.

Their verification work is pretty advanced: He talked about model-based test generation, using the same system for both verification and mutation-based VE quality assessment, coverage collection (of scenarios, risks and error guessing), a DSL for writing analog properties, FPGA-based run-time assertion monitors for diagnostics, and so on.

They also have a list of more than 1000 criticality measures (similar to Qtronic’s), which they track during simulation. Overall, pretty good.

And he also worried (in an offline discussion) that AV verification engineers were slow to pick up on all this. We discussed various ideas to fix that (this whole blog post is, hopefully, one small step in that direction). He mentioned that Dieselgate might, perhaps, make AV manufacturers more open about publishing the results of tests.

Gunwant Dhadyalla of Warwick U (in the UK) described their 3xD simulator – essentially a setup in which you can test any (physical) autonomous car, by subjecting it to video/Lidar streams, emulated (in-building) GPS, emulated radar (your car sends a radar pulse and they fake a reply), faked physical movement, vehicle-to-anything faked communication (including faked communication errors), and so on.

Some of this is in very early stages, and there are lots of open issues, but I like where this is going: Being able to inflict a CDV-style environment on a physical car, with a lot of control over everything, including error generation.

I met a bunch of people from the Nanyang Technological U and the Land Transport Authority, both of Singapore. Singapore seems to believe in autonomous vehicles (especially for first/last mile transport), and verification is understood to be an important part. They said they did not think through all the verification issues (perhaps they were just being modest), and also mentioned that AV verification requirements could be quite different between countries – for instance Singapore has more motorcycles zipping about than does the US. That issue of per-country (sometimes per-state) differences adds another interesting layer of verification complexity.

Chris Clark of Synopsys talked about finding security vulnerabilities using several methods. Their main weapon is fuzzing, which I think is the right way to go. We had an interesting discussion about how their more-targeted fuzzer compares to more-generic, no-setup fuzzers like AFL (summary: each has its place).

While I have you here, let me say some (potentially-controversial) words about security-vs-safety: I think security research is important (I have written about it e.g. here), but in the context of AV verification, safety is much more important. Yes, somebody will be able to remotely take over a car and drive it off a cliff. That’s willful murder, and I bet willful murders accounts for a tiny percent of those 30,000 US vehicle-related fatalities/year. As the inimitable James Mickens said of a (different) cyber threat, it “can be used against the 0.002% of the population that has both a pacemaker and bitter enemies in the electronics hobbyist community”

And then there is terrorism. Believe me, I can invent movie-plot scenarios with the best of them, and in fact I think some bad guys will eventually perform remote terrorism-via-cars. So AV security is important and I am happy some people are working on it. It is just that I expect many, many more people to be hurt due to safety bugs than due to security bugs, and I worry that the theater-like character of terrorism may cause research efforts to be misaligned with that fact.

Several people (including Adela Béres of AdasWorks and David LaRue of FEV) mentioned that level 3 (“conditional automation”) may never happen. This is where the car is autonomous until it needs to interrupt the driver, at which time the driver has 10 seconds to take control. I had a similar prediction in my original Stuttgart report (thinking it would be too stressful for the user), but the main reasoning they gave was verification: The car still has to do the right thing during these 10 seconds, and that’s pretty hard to verify. I have touched on similar issues in the last section of this post, where I compared the verification of mostly-autonomous and fully-autonomous systems.

Finally, Alexander Noack of B-plus gave a good presentation on “Famous pitfalls and limitations during the validation of autonomous systems”. This was about pitfalls encountered during physical testing: Things like unstructured cabling (which slows down setup changes), lack of good time-stamping for parallel streams, faulty data acquisition (due e.g. to wrong SW version or missing setup checks), and so on.

This made me realize (duh) that just like virtual testing, physical testing has its own set of best practices, and you have to be really professional to do it right.

That’s all I have, folks. We live in interesting times. Your comments are very welcome.


I’d like to thank Amiram Yehudai, Yaron Kashai and Sandeep Desai for reading earlier versions of this post.

Leave a Reply