Verifying interactions between AVs and people

Summary: The interactions between Autonomous Vehicles and people can be complex, which complicates AV deployment. This post summarizes some recent related publications, and then tries to predict the verification implications of all that. For instance, it suggests that verification teams will try to track total-accidents, AV-specific-accidents and AV-specific-annoyances.

There were several interesting publications lately about “bumps on the way to AV future”, with an emphasis on issues related to interactions between AVs and humans (pedestrians, drivers of non-autonomous cars, AV occupants and so on).

In a sense, this is good news: It means that we are advancing well on the basic sense-and-act parts of AV technology, and have now arrived at these “second order” (but very important) issues. Many people (myself included) think we should start using AVs when they are, say, twice as safe as cars-driven-by-human (using some reasonable metric of safety). It is OK if some issues still remain at that point, as long as we have a path to continued improvements. Thus, now is an excellent time to think about how to solve those human-interaction issues, and how to verify those solutions.

Some recent publications

Here are some recent publications, which I ruthlessly summarized to emphasize this AV-to-people aspect (they contain lots more):

Toyota’s Gill Pratt on Self-Driving Cars and the Reality of Full Autonomy is a thoughtful, interesting interview which says “AVs are not coming as fast as you may think” (HN thread is here). For instance, he mentions the tough issue of understanding why AV accidents occur when they do (see my post about explainable AI).

One of his main themes is that interactions between AVs and people are complex and hard to model, and yet should be modeled and verified. He says:

For example, an autonomous car has to be prepared for … a human-driven car, that’s driven by a person with road rage, and your car has to do the right thing, even though the other car was bad. And so, what’s the answer there? We’re not going to test that very often in the physical world, because half the time, they’re going to crash, so we’ll test it once in a while, but we need to boost it with simulation and that’s quite hard to do.

Asked about verification, he says:

We’re looking at hybrid approaches of simulation and formal methods, blended together to try to get at the dimensionality of the problem. But fundamentally a thing your readers should know is that this really comes down to testing. Deep learning is wonderful, but deep learning doesn’t guarantee that over the entire space of possible inputs, the behavior will be correct. Ensuring that that’s true is very, very hard to do.

Asked “how do you even try to simulate all the crazy things that humans do?”, he says:

We try to model human beings in the same way we model weather or we model traffic. … Every person’s different, and there’s a whole wide range of behaviors, but we know that to some extent it’s possible. As a human driver, we have theory of mind, of how other human drivers act. We run a little simulation in our head, like when we’re at a four-way stop, saying if I were that person, I’d act like this, so I’m going to do this. Theory of mind is this amazing thing that means that making a simulation is possible, because we can build statistical models to predict what other human beings are going to do.

Note the paragraph above uses “simulation” in two senses: How people simulate in their heads what another driver would do, and how a verification system simulates what people will do (including how they’ll simulate stuff in their heads). More on this below.

Andrew NG (of ML fame) talks about cases like:

If a construction worker uses hand gestures to tell a car to either go or to stop, no autonomous car today can reliably make the right decision. … If we see children distracted by the ice cream truck across the street, we know to slow down, as they may dash toward it. Today’s computers aren’t nearly as skilled at interpreting complex situations like these.

Rodney Brooks of MIT blogs about Unexpected Consequences of Self Driving Cars. He wonders how AVs will adapt to the unwritten traffic customs of areas near his Cambridge, MA home:

People expect the right of way … But people look to the driver for acknowledgement that they have been seen before they step in front of the car. … in winter people often walk along the roads, trying to give room for the cars to go by, but nevertheless expecting the cars to be respectful of them and give them room to walk along the road.

Cars will clearly have to be able to perceive people walking along the street, even and especially on a snowy day, and not hit them. That is just not debatable. What is debatable is whether the cars will need to still pass them, or whether they will slowly follow people not risking passing them as a human driver would. That slows down the traffic for both the owner of the driverless car, and for any human drivers. The human drivers may get very annoyed with being stuck behind driverless cars.

Somewhat less seriously, this article says that AVs are doomed in New York City:

Because our pedestrians and bikers can be really obnoxious, this logic will end the moment of the autonomous car about 30 minutes after it starts. That’s about how long it’ll take for the 8 million people on the city sidewalks to figure out all they need to do to cross the street whenever they want is simply cross the street. Self-driving cars will be forced to stop.

Brad Templeton (whom I mentioned before) answers both Andrew NG and Rodney Brooks, saying:

So the short answer is, solutions will be found to these problems if the roads they occur on are commercially necessary. If they are not necessary, the solutions will be delayed until they can be found, though that’s probably not too long.

… many people do expect systems to be developed to allow dialogue between robocars and pedestrians or other humans. One useful tool is gaze detection … machines shining infrared light can easily tell if you are looking at them. … There have been various experiments in sending information in the reverse direction. Some cars have lasers that can paint lines on the road. … You can also flash a light back directly at people to return their eye contact — I see you and I see that you saw me.

Over time, we’ll develop styles of communication, and they will get standardized. It’s not essential to do that on day one…. Services like Uber will send you a human driver in the early days if the car is going somewhere the systems can’t drive, or they might even let you drive part of it. Such incrementalism is the only way it can ever work.

Consistent with Brad’s common-sense approach, he also expects society to solve the obnoxious-New-Yorker issue:

Every time you jump in front of such a car, it will of course have saved the video and other sensor data. It’s always doing that. But the passenger might tell the car, “Please save that recent encounter. E-mail it to the police.” … The worst offender will get identified and get an E-mail from police.

Finally, if you thought Cambridge might be tough for AVs, consider this video from Addis Ababa, Ethiopia.

Verification implications – the big picture

OK, so we have a bunch of complex, human-related AV challenges, and hopefully a bunch of possible common-sense solutions, arriving incrementally. What will that do to AV verification?

Let’s start at the top:

AVs are never going to be perfect
We want to define some threshold which, when crossed, will allow AV deployment (per domain/area – more on that below)
That threshold should then guide both AV design and AV verification

For instance, we could say that the threshold will be crossed when total-accidents, AV-specific-accidents and AV-specific-annoyances are all “low enough”:

Total-accidents: Threshold crossed when AVs are, say, 2x safer than non-AV cars. This will probably mean reductions in accidents-per-mile of all kinds: Half the fatalities, half the injuries, half the fender benders.
AV-specific-accidents: Accidents which “would never happen to a human driver” are (obviously) a subset of total-accidents, but they are going to be much more noticeable, which is why I put them in a separate category: That future video (of, say, an AV misunderstanding hand gestures and thus failing to avoid a runaway car) is going to stick in memory for a long time.
AV-specific annoyances: These annoying-things-which-are-not-accidents can range from minor (AV passenger’s coffee spills when AV stops suddenly due to Obnoxious-New-Yorker) to fairly major (AV passenger has multi-hour delay for similar human-interaction reasons).

Verification teams are probably going to devote most of their efforts to finding bugs influencing total-accidents. But they are going to devote significant time and imagination for dreaming up scenarios which can uncover AV-specific -accidents and AV-specific annoyances (and somebody will have to define what “low enough” means there). AV-people interactions are obviously a major contributor to these last two categories.

Note that we are not expecting any of these three measures to ever reach zero. We are simply defining a threshold moment, a moment when they are all low enough. The threshold moment will be reached separately for separate domains (areas) such as:

California highways in good weather
Places like Cambridge (with some restrictions and no snow)
Places like Cambridge (no restrictions)
Places like Addis Ababa

For any given domain, the job of verification is to find bugs in each category (measure), and to eventually show/prove (to ourselves, to the regulators etc.) that the AV is good enough in each category. Obviously, we want to reuse the same verification infrastructure for all domains.

Note that, especially when humans are involved, verification is tough because it is probabilistic. An annoyance is not necessarily a bug, but “too many annoyances with a common, avoidable cause” probably is. I have discussed this in Checking probabilistic components.

Also, verification will proceed using multiple execution platforms (virtual, test tracks etc.). The “Synthetic Sensor Input” problem in AV verification is probably quite relevant.

Verification implications – some details

Here are some thoughts that come to mind:

Discovering new issues and adding them to the verification mix: We are not going to think of all those AV-people interaction issues ahead of time: In fact, I never thought of the “Obnoxious New Yorker” scenario before. However, once we discover such an issue, we should try to implement the solution as generally as possible, and then implement the verification as generally as possible.

This means e.g. creating a highly-parametrized “obnoxious_pedestrian” scenario, running it in many variants (and mixed with many other scenarios) and checking for reasonable (probabilistic) outcomes. I described a similar process in the post The Tesla crash, Tsunamis and spec errors.

Unification via the scenario catalog In that post I also suggested the creation of an (ideally industry-wide) ever-growing catalog (library) of AV scenarios (and their related coverage definitions / parameter values). That common catalog will be used by AV manufacturers, regulatory bodies and so on. It would (among other things) help unify the language used by all stakeholders for describing scenarios, parameters and intended results.

And such uniformity is especially important for those potentially-fuzzy AV-people interactions. Consider an initial phase, where just a small (but growing) percentage of all cars are AVs: We really don’t want those AVs to have five different manufacturer-specific ways to react to a crossing person. I expect regulation to eventually extend to those issues, and to also demand standard ways for AVs to signal their intent to people (and probably standard ways to indicate that those are, in fact, AVs).

Modeling people: Clearly (as Gill Pratt says) we need to be able to simulate human decision-making, including variability, mistakes, theory-of-mind and all that.

I have been playing with one specific paradigm for modeling people (BDI), which may be able to do this. Here is a description of a BDI-based prototype I created, as part of a system for verifying autonomous robots.

BTW, autonomous-robot verification (while clearly lagging behind AV verification) has always emphasized the human-interaction part. See for instance this paper (pdf, by my friends at the University off Bristol): When verifying an assistive robot and how it hands an object over to a human, they need to carefully consider all permutations of did-the-robot-notice-that-the-human-noticed.

Country differences vs. neighborhood differences: I assume AVs will have different SW configurations etc. for different countries (and perhaps even states): The differences (in rules, driving side, behaviors, gestures etc.) are just too big. And those different configurations will have to be verified.

I obviously do not expect different SW configurations for different neighborhoods (even though those could have different customs, as Rodney Brooks describes). I do expect verification teams to create many scenarios, representing many such neighborhoods (complete with local, statistical behavior), and to verify that on average the algorithms handle them all more-or-less correctly, and that there are no pathological cases.

People vary greatly, and along many dimensions. Culture (and gestures) can be very different even within a country. Those verification teams will have a tough job (and will need anthropologists and sociologists).

From car verification to municipal verification: I agree with Brad Templeton (in his above-mentioned post) that ”You improve your car to match the world you are given, you don’t ask the world to change to help your cars”. Thus, initially I expect AVs to drive around in unmodified municipal areas.

However, as AVs multiply, I expect municipalities to start planning with those AVs in mind. There will be many decisions to make: Where and how to cut parking, what to do about public transportation, should there be AV-only lanes, and so on. Some of those decisions are not easily reversed, and thus should be planned, simulated and verified ahead of time.

That’s not going to be easy. Predicting what people will do is hard even in the short term (you may have noticed some instances of that around the world lately), and AVs will still be new-and-changing. I have written a post about the hardships of this municipal-verification issue, and another post about why I expect this to become easier in the future.

In any case, I would hope such “municipal-adaptation” verification projects will be able to reuse many of the verification artifacts created for AV verification. For instance, scenario definitions (especially those related to AV-people interactions) should be reusable.

Notes

I’d like to thank Gil Amid, Bob Bentley, Sandeep Desai, Thomas (Blake) French, Benny Maytal and Amiram Yehudai for commenting on an earlier draft of this post.

	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…
	Aman on DeepXplore and new ideas for v…
	Angela on Verifying how AVs behave durin…

The Foretellix CTO Blog