Stuttgart impressions: Scenarios and problems

Summary: This post talks about scenarios as the main tool for serious Autonomous Vehicles verification, but mostly about the more “mechanical” obstacles standing in the way of industrial-scale usage of scenarios – those related to repeatability, HIL issues and behavior stability

As promised, here is my (first) report from this year’s Stuttgart Autonomous Vehicle (AV) Test and Development symposium and related expo. Expect follow-on posts – there was too much stuff for one. Most of this, as always, is based on discussions with people.

Here are some of my top-level observations:

This is moving even faster: There was a feeling that initial deployment of AVs may happen somewhat earlier than people assumed previously (and thus verification efforts are somewhat more urgent). A new article in the Economist (“The market for driverless cars will head towards monopoly”) captures this sentiment well.

Scenarios have arrived: In last year’s Stuttgart report I said “Simulations are now pretty much the accepted way to find most AV bugs (though clearly they are not enough)”. This year, the term “scenarios” was also in the air (and in many of the presentations’ titles). Just what people mean by scenarios varies – more on that in future posts.

You need to run many, many scenarios: I talked here about the need for many scenario runs, and about the need to mix scenarios from multiple risk domains (dimensions). There were many discussions about this, and about the contents of those scenarios: How to cover all the relevant aspects, how to get to interesting corner cases and so on. This is what Foretellix does, and I’ll talk about some of these high-level aspects in some future posts.

But even if all of that is perfect, your verification effort may still be derailed by more “mechanical” problems:

You need a massive amount of runs, but several “mechanical” problems stand in the way: I had several discussions with people about the mechanics of executing, checking and debugging the massive amount of runs needed to cover all those scenario variations (perhaps many millions of scenario instances per version release). I mostly knew about these problems, but the discussions made them clearer.

So this post is devoted to what I consider to be the three main “mechanical” issues: Repeatability, HIL issues and behavior stability across versions. For each of them, I’ll describe the problem, explain why it may hurt bad, and discuss some possible solutions. If you have other thoughts about the severity of the issues, or other solutions, please share them.

A typical simulation setup

Take a brief look at figure 1 below, which I’ll use to explain all three problems:

Figure 1 shows a typical (but much-simplified) AV simulation setup:

At the top is the “world” (or “environment”) simulator, with all the cars, people, roads etc. (note: just how these are orchestrated via scenarios will be discussed in some other post – for now just assume they are)
At the bottom is the “ego simulator”, which simulates the device-under-test – our AV. The Ego simulator contains the AV’s HW, SW, and ML blocks. It also contains (a model of) the physical pieces: Sensors, engine, brakes and so on.
The ego simulator gets a stream of sensor inputs from the world simulator, and returns the current location of the ego car
There is also a fairly popular technique called “sensor bypass” (represented by the red circle in the middle), where you bypass the sensors and perception blocks, and feed the actual objects from the world simulator (with some errors inserted) directly into the planner. One reason it is popular (though obviously not enough) is that creating accurate sensor inputs is hard.

Now that we know what AV simulation looks like, let’s go through these potential issues.

The horrors of non-repeatability

Some test execution platforms (e.g. automated test tracks) cannot be completely repeatable (reproducible). But assuming most verification runs will be done in simulation, one would hope that at least pure-simulation platforms would be repeatable.

Well, no. Many simulation setups (e.g. ROS-based ones) are non-repeatable. I mentioned this several times before, but I am not sure I have been able to fully convey the horrors of non-repeatability when doing real verification (as opposed to “occasional go-no-go testing”). To get the right perspective, consider how things are done in the chip industry, where the cost of bugs is pretty high (though perhaps not as high as in AVs):

Say you manage a major chip design project. You do serious verification by running perhaps hundreds of thousands of simulations every weekend, looking for bugs and corner cases. Some of those runs (say 1%) fail with a checker error message. Then an automatic clustering algorithm tries to cluster those 1K failing runs into “probably the same bug” clusters. It may even re-execute the shortest representative run of each cluster (with all debug / trace options turned on).

Then, some of your best, most-expensive engineers spend the beginning of the working week analyzing what happened, trying to debug the representative runs. Once bugs are understood and fixed, there is a separate effort to create tests which check the fixes. And so on.

And this is what you do, week by week (interspersed with the inevitable design changes) until your chip is good enough to “tape out” (into silicon). You have no choice: Bugs are a major risk, and you don’t last long in a very competitive industry unless you adopt best practices for dealing with them.

Now imagine somebody came to you and suggested that you should still do all of that, except that from now on simulation runs will stop being repeatable. They will still be somewhat repeatable (he hastens to say): Some of those 1K failing runs (i.e. test + random seed combinations) will still fail, but some will not, or not in the same way. Also (he might say) this is actually good for you: Since every run is slightly different, you get to cover even more cases.

Well, chances are you will (respectfully) suggest that he goes back to the madhouse he escaped from. There is no way to do serious verification under these conditions and still be competitive: The effort required to reproduce and debug failures is simply so much higher.

Also, everything becomes much harder and less certain: Those five bugs we fixed last week? Two of them were perhaps not really fixed – it is just that the test + seed combinations that were supposed to demonstrate the bugs ran slightly differently. And don’t even think about doing all the other useful things which depend on repeatability (like re-running a test which has reached an “interesting” point, but tweaking it to explore around that point). And so on.

And yet in AV verification, some people try to do just that. This only makes sense if AV verification is much easier than chip verification (or if the stakes are much smaller), but I seriously doubt that.

Here is one example (skip if this is getting too technical for your taste): The SW part of the ego simulator (see fig. 1 again) is often constructed using ROS (the “Robot Operating System”). This is a not a bad choice – ROS is a useful, open-source robotics framework, and it can run unmodified on the real hardware and in simulation-only mode. However, even in simulation-only mode ROS is not repeatable, as I explained here:

Turns out that currently, runs executed with ROS+Gazebo are not repeatable, mainly because nodes are run as separate Linux processes and thus can shift in time relative to each other, at the whim of the Linux scheduler and HW events.

And if part of your simulation is non-repeatable, the whole thing becomes non-repeatable.

BTW, that specific problem (the non-repeatability of simulation-only ROS) can be fixed: Use a repeatable, seed-based scheduler, have all random calls use an externally-supplied seed, and so on. I discussed that with some of the ROS people, and they acknowledged the issue but had higher priorities at the time. If you know somebody who is willing to be that hero (fix the problem and put the result back in the public domain) have them send us a note (info@foretellix.com) – we would be happy to support them with technical advice and even a small grant if necessary.

Note that chip people sometimes must work with non-repeatable environments and false errors (e.g. when things fail in the lab). So they know all those useful tricks like record-all-the-inputs-to-each-module-and-then-do-module-by-module-debug. But this is much more tedious (and does not solve the other problems caused by non-repeatability mentioned above). So chip people would not dream of calling that “best practice” for their main verification flow.

The potential horrors of HIL

How bad are HIL (Hardware In the Loop) AV setups? I don’t have first-hand knowledge of this, but my tentative answer (based on several discussions) is “often pretty bad”. These setups are often hard to maintain, have to run in real time, and are non-repeatable. I am sure not all setups are like this, but apparently quite a few are.

The “Rest of car” block in figure 1 is made from many components (e.g. ECUs – Electronic Control Units). Those often do not have accurate simulation models (sometimes because the suppliers of these components want to protect their intellectual property). Thus, they must be HIL-simulated.

This (combined with the fact that any specific component may come in several, slightly-different variants from several suppliers, used interchangeably in the same AV) can make this whole rest-of-car simulation a real pain, supposedly causing some teams to delay full-AV verification.

If this is indeed the case, perhaps one should consider also using a high-level model of the “Rest-of-car” in the full-AV simulation. This is of course not enough (you also need to simulate with the full, accurate models, or their HIL equivalent), but it may be the best way to catch unexpected bugs before they cause real trouble.

The HIL situation may actually be getting worse now, because so much of the AV is constructed using Machine Learning. For instance, in Fig. 1, the perception module and parts of the planner are often ML-based.

Why is ML a problem? Well, those ML blocks are usually executed by ML chips (made by Nvidia, Mobileye and their many new competitors). And it turns out most of those chips do not have a fast, accurate simulation model. In other words, you must use the actual chip in every simulation.

This has some non-trivial implications. It means that you have no alternative to HIL: You need to run all your millions of simulations on it. If you want to run 1000 simulations in parallel, you have to buy 1000 chips / boards, maintain them in a simulation setup, and replace them if the chip version changed. Also, this setup may be non-repeatable, getting us back to the horrors described in the previous chapter.

Note that even if you run in “sensor bypass” mode, you may still need to run in HIL mode if you use ML in your planner.

Making HIL setups repeatable is possible – chip design people are familiar with emulation setups where somebody has taken care of all the issues (cabling, jitter, rate adaption etc.). These setups are absolutely, coffee-break repeatable (defined here to mean “for any possible bug encountered during a run, you can break anywhere in the code, take a coffee break, come back and hit ‘continue’, and you will still encounter the same bug”). However, these setups are cumbersome, expensive and sometimes slow.

Again, I feel I am on shakier ground in this chapter: Perhaps many of these vendors do supply a good simulated version of their chips, or they supply convenient, repeatable HIL setups, or there is some reasonable FPGA-based solution, or people are willing to take the trained ML system and just run it on stock CPUs during verification. My general impression was that the answers (reading from left to right) are “no, no, no and no”, but I could be wrong, so comments are very welcome.

The lesser horrors of unstable behavior across versions

Even if you take care of repeatability and HIL issues, you may still face the problem of unstable-behavior-across-versions. This is a smaller problem, but it is also harder to fix completely. Let me explain:

Say your total AV pipeline (the lower part of fig. 1) is very complex and has a lot of ML in it. This may make it very hard to have a bunch of go / no-go tests. Here’s why: Suppose that you changed some planning parameters, or re-trained your ML systems. This will often cause hard-to-predict changes in overall behavior: The result may still be legal, but subtly different. You certainly cannot compare the logs to the previous behavior (a problematic practice in any case).

All SW systems have this issue to some extent, but it is worse in AV-like systems, because small changes cause many, unpredictable behavioral changes, making the changes harder to check. Probabilistic systems like AVs are harder to check in general, as I described here.

Solving this problem is hard. It probably involves a small set of tightly-controlled go-no-go tests, combined with a set of runs covering a “space”, and a set of probabilistic checks verifying that they still reside within the “correct” area. I’ll leave the complex issue of checking / grading runs for another post.

To summarize: Even if you have a perfect system for defining, mixing and running scenarios, you still need to pay careful attention to the above problems. Most AV verification people probably know about them in principle, but I think it is easy to underestimate their total effect.

Notes

I’d like to thank Yaron Kashai, Ziv Binyamini, Sandeep Desai, Kerstin Eder, Amiram Yehudai, Daniel Meltz, Gil Amid, Avishai Silvershatz and Roberto Ponticelli for commenting on earlier versions of this post.

Stay tuned for my next Stuttgart report.

[Added 28-June-2018: A more technical note regarding that ROS repeatability issue: I was asked if I am proposing to change ROS scheduling in general. So let me clarify.

I am not suggesting changing ROS in general: I don’t think it can be made repeatable when running e.g. on multiple CPUs. All I am suggesting is that there will be a repeatable version of it, for running in fully-simulated mode. ROS2 has a more flexible threading model, and I think you could (optionally) implement it on top of a repeatable thread package such as SystemC , which fakes multiple parallel threads in a completely deterministic way (per-seed). This would work the same regardless of any real delays: Thread scheduling order is simply randomized using the seed, and two threads never really run in parallel even in you have multiple cores.]

One thought on “Stuttgart impressions: Scenarios and problems”

https://otomotif71.wordpress.com/ says:

January 9, 2026 at 7:19 am

This article does a great job connecting automotive history with modern innovation and future trends in an interesting and educational storytelling style.

Loading...

	When is misalignment… on It’s the spec bugs that kill y…
	When is misalignment… on Verifying friendly AI: our fin…
	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)