Summary: This post will dig into the verification challenges of autonomous systems, and especially into the challenge of verifying those which use machine learning.
This post will discuss the verification of Intelligent Autonomous Systems (IAS) which have to interact with people and the outside world. Examples are AVs, UAVs, and autonomous robots of various kinds.
IAS are just one category of the complex systems this blog is interested in (e.g. see this picture), but it is an increasingly-important category, with everybody and their brother racing to build, certify and deploy various kinds of IAS.
Like many other complex systems, IAS are hard to design and hard to verify. However, it seems that in IAS the relative size of the verification (and certification) problem is somehow bigger.
For instance, take a look at this Air Force 2010-30 technology vision, It keeps repeating how
Achieving these gains from use of autonomous systems will require developing new methods to establish “certifiable trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of adaptibility; the lack of suitable V&V methods today prevents all but relatively low levels of autonomy from being certified for use.
Obviously, this is not just in military stuff. In many areas, there is a push for more and more autonomy. Take UAVs (drones): They normally have remote pilots (and need some autonomous capabilities just in case communications get cut). But even more autonomy will bring a lot of benefits. For example, there are a bunch of companies selling drone inspection services for offshore oil installations. They would be even more profitable if those drones were autonomous:
“Once aircraft know where other vehicles and obstacles are and they have the ability to safely avoid collisions and areas of known traffic density, it will open up many new applications.”
So it looks like there are very strong incentives to overcome these IAS verification challenges. But it is hard, for various reasons, including:
- The world around the IAS is sort of open-ended
- IAS verification is new, so it is hard to even specify the requirements
- IAS verification is new, so it is too splintered, with too many notations
- Machine learning (often used in IAS) makes verification harder
I mentioned all these issues in previous posts. My best ideas for solving them, such that they are, are summarized at the end of this post.
Now let me expand a bit on that last point about machine learning.
Systems based on machine learning are hard to verify
Machine learning (and especially NNs) present two big problems for verification:
- They work well, but it is hard to understand how they work, in the sense of “what are the rules they operate by”. As somebody said, “they don’t have a spec, they just have a résumé”. Thus it is hard to reason about them and verify them, especially in the context of a bigger system which is not all-NNs.
- A second issue concerns on-the-fly machine learning: This can be quite useful (e.g. let your AV improve as it operates), but some regulatory bodies (and hence manufacturers) don’t like it because it makes verification harder: Essentially, the machine on the road is not the same machine that was verified in the factory.
Here, I’ll expand a bit on the first issue, and talk about what verification means for ML, why it is hard, and how people tried to work around that problem. As always, have your grains of salt handy: I am not a ML expert.
The big issue is that an ML system is opaque (black-box). Yes, this is still all just SW, and you can certainly look at it, but it is very hard to interpret: For instance, the knowledge learned by NNs is encoded as numerical weights of various node-to-node connections. You can print them out and stare at them all day long, but it will not help you much. And yes, people have tried to convert those numbers into readable rules / decision trees (this is called “rule extraction”), but this yields approximate results and does not scale well.
Because ML systems are opaque, you cannot really reason about what they do. Also, you can’t do modular (as in module-by-module) verification. And you can never be sure what they’ll do about a situation never encountered before. Finally, when fixing a bug (e.g. by adding the buggy situation + correct output to the learning set), you can never be sure (without a lot of testing) that the system has fixed “the full bug” and not just some manifestations of it.
Using ML for just the lower level
The issue is less severe when the ML system is just called upon to do some classification / recognition work, with the top-level logic staying in procedural / rule-based logic. This is, in fact, how most ML systems are used today.
For instance, that chess-playing program is not really all NN: The top-level logic (and much else) is plain C++. NNs are used to implement three specific (but crucial) “functions” within the program: Statically evaluating positions, deciding which branches are most ”interesting” in any given position, and determining which moves to search before others.
The following Michael Fisher interview seems to say similar things. His figure 1 shows the top level to indeed be a non-ML “high-level reasoning” (he specifically favors BDI for that), and sounds like he expects this state of things to continue.
Still, even this kind of “lower-level ML systems” need to be verified (to make sure they don’t freak out in previously-un-encountered situations).
Let’s use a (simplified) example of an AV, where the NN takes radar and camera streams as input, and produces “recognition events” such as “a truck is approaching fast on the lane to our left”, “a person is starting to cross the road on the right” and “there is a steel divider on the middle”. Say we have a lot of examples showing that it did this well. However, we as verification people will immediately start asking questions like:
- How well does it do these recognition jobs at night or during fog?
- How well does it distinguish between “a person is starting to cross the road on the right” and “a bus passes on the right with a person painted on it”?
- Did somebody ever train it to recognize “we are approaching a washed-out segment of the road”?
- And so on
Of course, if/when we go to “full-control ML systems” (e.g. when the AV uses NNs not just to perceive events but also to act on them), verification is even harder.
So what do people do about ML systems verification (you ask)? Thank you. An excellent question.
So how do people verify ML systems?
From what I have read so far, there does not seem to be any really-good way to do it. Here are what seem to be the main ways (a good summary is the 2003 paper Verification and validation of neural networks: a sampling of research in progress):
- Cross validation
This is the simplest validation technique, and just about everybody uses it: You train your NN via, say, just ¾ of your samples, and then use the final ¼ to do checking. This simply tells you that training taught it more than just “recognize the training set”.
- Safety monitors
This idea simply says that the ML system is not completely independent: It has some user-written logic (in some programming language) which checks its decisions on the fly and can disable / override it if it does not like the result. This is mainly a runtime mechanism, but can also be used during testing as a (superficial) checker.
- Formal methods
People have tried to use formal methods for verifying ML systems. This is mainly done (I think) in conjunction with “rule extraction” as mentioned above, and did not strike me as very successful (but I am biased so I could be wrong).
- CDV-style testing
Admit it – you knew this was coming, since this whole blog is about using CDV-style testing for various complex systems. This technique has its own challenges, but nevertheless the next two chapters are devoted to it.
CDV-style testing of ML systems
The idea here is to use the usual CDV tool suite for ML systems testing: Automatic constraint-based test generation (enhanced with use cases / scenarios), automatic checking and coverage collection.
CDV-style testing of ML systems has its own issues (see below), but given the unexciting alternatives, I think it will eventually emerge as at least a strong contender.
In the summary referenced above this is called “Automated testing and test data generation methods”. The main issue mentioned there is that creating synthetic data streams is hard.
The issue is this: It is (relatively) easy to define various high-level scenarios (“Truck coming from left while pedestrian starting to cross”). The problem is translating them into continuous streams (in our example – visual and radar streams).
This is not an insurmountable problem (e.g. video games render imaginary visual streams all the time), but it is still pretty hard to do faithfully, while coordinating the various streams correctly.
This is why Google and others seems to test their AVs using a combination of recording and synthetic information: I.e. they probably use recorded trips through existing roads, but may be able to e.g. superimpose synthetic “pedestrian crossing events” on top of that.
Dealing with black-box verification, and the idea of the-VE-as-the-spec
ML systems are opaque. Thus, CDV-style verification of ML systems is harder than “normal” CDV-style verification. However, here are some possible mitigating factors:
- System testing is often black-box anyway. Certainly acceptance testing tends to be black-box: You don’t care how it works, you just want to see that it works well.
- It may be possible to design ML systems to gives visibility into “intermediate information”. For instance, in that “full AV-control ML system” (where a NN does both the recognition of events and the reacting to the events) we can modify the NN so it produces both “what to do next” outputs and “what I just observed” outputs. These “what I just observed” outputs can then be used in checking and coverage. This is called “design for verification”, and is often a good idea.
- While the ML system does not have “sub-modules” that you can test in a modular way, the VE (Verification Environment) itself can be modular.
Let me expand on that last point. Suppose you build the VE for our NN-based full-control AV system in a modular fashion: There is an “identify-risk” module (split into sub-modules for the various risk kinds), an “avoid-risk” module, and so on.
This modular VE lets us verify our (black-box, non-modular) DUT in a modular way. We can say that the (modular) spec of the VE applies to the DUT, and thus we can reason about the behavior in a modular way.
There is actually a further point here: We can use this VE not just for checking our NN, but also for programming it: Remember that e.g. NN supervised learning is nothing more than feeding the NN a big bunch of situations, as well as the right “answers” for them.
Our VE is potentially a good source of both situations (via constraint-based generation) and the right answers (via at-least-partial checking). At least, this can be used to supplement the “normal” supervised learning, adding rare and dangerous situations.
So to summarize:
- IAS verification is hard
- ML systems verification is hard, and can hold back IAS verification (and thus usage)
- CDV-style verification may emerge as a leading candidate for ML systems verification
I’d like to thank Sandeep Desai and Amiram Yehudai for reviewing a previous draft of this post.