DeepXplore and new ideas for verifying ML systems

Summary: This post talks about the DeepXplore paper, and uses it to revisit the topic of verification of ML-based systems

The paper DeepXplore: Automated Whitebox Testing of Deep Learning Systems (by folks from Columbia U and Lehigh U) describes a new and (in my view) pretty important way to verify ML-based systems. And it somehow starts breaking the wall between the ML world and the CDV-like, dynamic-verification world.

As I read it, I kept saying to myself “Hey, this is great stuff. On the other hand, they seem to ignore X Y and Z”. So I corresponded a bit with the authors, and it turns out they are mostly aware of X Y and Z (and in fact plan to attack some of them in future research). I’ll quote their answers in the post below.

Ok, so what is it?

What’s in the DeepXplore paper

The paper describes a new way to verify DNN-based system (note that I am using their acronyms here: DNN for Deep Neural Network and DL for Deep Learning).

From the abstract:

We design, implement, and evaluate DeepXplore, the first white-box framework for systematically testing real-world DL systems. We address two main problems: (1) generating inputs that trigger different parts of a DL system’s logic and (2) identifying incorrect behaviors of DL systems without manual effort. First, we introduce neuron coverage for systematically estimating the parts of DL system exercised by a set of test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles and thus avoid manual checking for erroneous behaviors. We demonstrate how finding inputs triggering differential behaviors while achieving high neuron coverage for DL algorithms can be represented as a joint optimization problem and solved efficiently using gradient-based optimization techniques.

DeepXplore efficiently finds thousands of unique incorrect corner-case behaviors (e.g., self-driving cars crashing into guard rails, malware masquerading as benign software) in state-of-the-art DL models trained on five popular datasets including driving data collected by Udacity from cars around Mountain View (CA) and ImageNet data.

There are really four main ideas there that I liked:

Using neuron coverage
Checking DNN outputs by comparing outputs of multiple, similar DNNs
Automatically “nudging” runs towards places where the checking defined in (2) finds an inconsistency, while maximizing the coverage defined in (1), and while taking constraints into account
Doing (3) using efficient, gradient-based joint optimization

I also have issues with just about every one of these. This is not a bad thing: The fact that this paper raises so many interesting questions is a big plus – I am looking forward to follow-up research. Also – my issues are mainly related to the “whitebox-only” nature of DeepXplore (which, of course, is also an advantage).

Let me go through the four ideas in order, discussing why I like them and what issues I see:

Using Neuron coverage

Their coverage metric for a set of DNN runs is “what percentage of all neurons (DNN nodes) are activated at least once during these runs”. The idea is that for a given DNN input, each neuron is either activated (i.e. crosses the threshold) or remains zero. And as the paper says:

It has recently been shown that each neuron in a DNN tends to be responsible for extracting a specific feature of the input. … This finding intuitively explains why neuron coverage is a good metric for DNN testing comprehensiveness.

Note that the nodes may or may not correspond to features which can be described in English (e.g. “the object has two eyes”). But the optimization process of DNN training usually does makes them correspond to some “reusable feature”, i.e. something that is useful for the DNN across many different inputs.

Trying to automatically make neurons active has been done before, e.g. for visualizing what they do (see here), but using neuron coverage for verification seems like a new (and good) idea.

Potential issues:

Note, however, that neuron coverage (like code coverage in SW verification) is really implementation coverage. And implementation coverage (as we all know) is not enough, mainly because it does not help with bugs of omission: If your receive/transmit SW module forgot to implement transmit, or forgot about issues related to receive-while-transmitting, then you can achieve 100% code coverage and be perfectly happy until somebody actually writes a transmit test (or the user tries to transmit). See discussion of the various kinds of coverage (implementation, functional etc.) here.

This is also true for DNNs: If you forgot to train your driving DNN on driving-on-the-left, or on unprotected left turns, or on two cars arriving at the stop sign at exactly the same time, or on people-painted-on-busses, or on Tsunamis, then you may have 100% coverage on a very buggy system.

The authors replied:

We completely agree with your argument. Full neuron coverage (just like code coverage) will not provide any guarantees about all possible bugs. That being said, we are also thinking about extending the definition of neuron coverage to include different types of coverage (e.g., neuron path coverage just like path coverage in traditional software)

I think extending neuron coverage could help somewhat. Suppose (simplifying wildly here) that our driving DNN has a neuron for “dog on the road” and another one for “car on the road”. Then, in addition to covering them individually, we also want to cover their negatives (“no dog”), combinations (“car and dog”, “car and no dog”) and sequences (“car and then dog” – what they called neuron path coverage above).

However, this is still just implementation coverage – see more comments below.

Checking DNNs by comparing to other implementations

The way they check behavior is by comparing multiple, somewhat-different DNN-based implementations and seeing if they all agree. This is an extension of SW “differential testing”, which (for you guys coming from HW verification) is mostly just good old “checking via a reference model”.

Potential issues:

Checking via a reference model usually assumes that the reference model is (mostly) golden, i.e. can really be trusted. This, of course, is not the case here, hence the potential issues.

For instance, will this catch bugs of omission? Only for cases which were considered during training of some of the DNNs but not for others. But I doubt very much this is good enough. The paper says:

If all the tested DNNs make the same mistake, DeepXplore cannot generate the corresponding test case. However, we found this to be not a significant issue in practice as most DNNs are independently constructed and trained, the odds of all of them making the same mistake is low.

But this assumes (a) that you have access to many different implementations (not always possible in realistic, competitive markets), and (b) that at least one of the guys thought of e.g. a Tsunami – also not necessarily realistic.

The authors replied:

This is indeed a limitation of differential testing. One way of testing one model in a standalone manner is to resort to adversarial DNN testing techniques, which currently only allow negligible perturbation invisible to human eyes. One can potentially extend adversarial testing to use a wide range of realistic constraints as you mentioned later. For example, by changing the light effect. However, the major problem with this approach is lack of data labels. It is hard to decide whether the DNN model is making the correct classification or not without access to a human oracle as arbitrary perturbations can, in theory, change the true label of the images arbitrarily.

Indeed, without an “independent” way to check stuff, this is probably a pretty good way. But the CDV option (which indeed means a lot more work) says that a human being will write a separate, manual set of checks / assertions.

For instance, if we are talking about the Udacity driving example, one could write checks which say “don’t run into other cars” etc.. Those are all probabilistic checks, so it is a non-trivial undertaking, but people who do serious autonomous vehicle verification probably do it. BTW, the safety stuff is usually not handled just by ML – more on this below.

These people will also augment implementation coverage with functional coverage, like “did we pass a left turn? an unprotected left turn? a left turn with a car coming our way? a left turn on a rainy day?”, as I mentioned above.

Here is another potential issue with differential checking: Suppose there is a large overlap, but each of the guys considered a few cases others did not. Then running DeepXplore will mark those cases as “inconsistent”, and you will have to go through them all and decide who is correct (i.e. is this a bug in my implementation, or just something one of the other guys forgot?).

Also, there are often cases where there is no “one right answer”: The car could turn either left or right, or (more subtly) the decision boundaries (for things like “can I start turning now?”) can be somewhat different between those guys. And when inconsistencies only mean “there could be an error here” we can expect a long, manual process.

Nudging runs towards check errors, while maximizing coverage and obeying constraints

This is perhaps the most interesting part. As they describe it:

Finally, we demonstrate how the problem of generating test inputs that maximize neuron coverage of a DL system while also exposing as many differential behaviors (i.e., differences between multiple similar DL systems) as possible can be formulated as a joint optimization problem and it can be efficiently solved for large-scale real-world DL classifiers. Unlike traditional programs, the functions approximated by most popular Deep Neural Networks (DNNs) used by DL systems are differentiable. Therefore, their gradients with respect to inputs can be calculated accurately given whitebox access to the corresponding model.

I really like this: This includes both what we call coverage maximization and bug-hunting, in the same tool. And they try to do it while obeying some constraints: For instance, in visual tasks they only make the image darker or cover small rectangles of it, and in malware-detection-in-files they stick to some constraints on the structure of the files.

Potential issues:

A big issue here is what I’ll call the “constraint problem”: The allowed constraints are not nearly flexible enough, so the resulting inputs can only be “the inputs used for training, plus small modifications”.

One would really like to be able to specify flexible constraints, which (together with some underlying “input model”) would allow not just small modifications but drastic changes (like driving on the left or people painted on a bus), while at the same time avoiding completely-unrealistic changes (like cars floating in the air).

Is this possible? It certainly sounds tough. The malware examples shows that one can customize constraints to some extent.

My favorite solution for most of verification is some variant of CDV. Thus, I am hoping to create constrained-random, model-based inputs for verifying all systems, including DNN-based systems. The DeepXplore folks talk only about taking existing inputs and mutating them somewhat. It would be interesting to look for some combined way.

The authors replied:

You are right, the constraints we came up in image setting is still not flexible. We consider them because they can be efficiently guided by the gradients. In fact, there are many other data augmentation technique: for images, we can rotate, flip the images and even do some semantic transformations (e.g., change BMW X6 to BMW X1). However, these transformations cannot be efficiently computed using gradient — we can only transform the images with these constraints randomly, hoping some transformations will trigger the different behaviors between models. It will be very interesting to identify the classes/properties of the constraints that can be efficiently guided by gradients.

… coming up with realistic models/constraints for data like images is a hard problem in itself. In DeepXplore, we start with the realistic input hoping that the mutated sample will be valid and therefore can appear in the wild. We also tried starting with random samples and found that DeepXplore can identify difference-inducing inputs, but they do not look like an image that do not look real.

In the ML world, a popular way of performing similar tasks is to use generative adversarial net (GAN): given a random vector, it can learn to generate real inputs that are indistinguishable to the real inputs (e.g., https://github.com/jayleicn/animeGAN). We are exploring generating inputs using this technique too.

I blogged about GANs-for-verification, and I agree that GANs can help in the generation of somewhat-directed, realistic verification inputs. But I still suspect that the “newness” in them is going to be limited, so this is not enough.

What I would really like to see is a pipeline where the input generation starts from some constrainable, high-level description. For instance, consider the following (purely speculative) flow:

Suppose we have a text-to-image neural network: You tell it “there is a junction with a car coming on the left”, and it presents an image of that. I described (in the post mentioned above) how people did such things using a GAN and an RNN, but there are probably other ways to do it.

Now you feed the resulting image (or sequence of images) into the Udacity driving DNN, and get a required-steering-angle output. Now, rather than perturbing (say) the darkness of the picture, you perturb the text input, and those differential tricks hopefully work all the way back from output to text. And now you can also apply logical constraints to the text.

If that worked (a big if), you could perhaps achieve even more complexity and variability in the input by using yet another DNN for combining the outputs of several such text-to-image DNNs into a composite scene (“a junction with a car coming on the left and a dog on the right, and the stoplight just turned green”).

This idea will probably not work as described – I am just trying to outline a direction which is closer to “real CDV”.

Doing efficient, gradient-based joint optimization on DNNs

In a previous post I said I expect ML to be used more and more in verification of Intelligent Autonomous Systems. One of the reasons I gave was:

Deep Neural Networks are based on “differentiable” computations, so it is much easier to “nudge” them in any direction you want. More on this in a subsequent post.

The current post is that “subsequent post”. Here’s how the authors explain how they compute error-causing inputs:

Note that, at a high level, our gradient computation is similar to the backpropagation performed during the training of a DNN, but the key difference is that, unlike us, backpropagation treats the input value as a constant and the weight parameter as a variable

In other words, they are using essentially the same efficient linear-algebra machine that is the bread-and-butter of DNN training and deployment to do something else – in this case to find the inputs which will reach a coverage point (or an error, or both).

Think how hard it is to do that in normal HW or SW verification: To reach e.g. a particular error line, you do tricks like smart random with lots of manual work, or Concolic generation, or (for small-enough systems) model checking – see some references at the end of this post. But for DNNs, it is relatively easy. BTW, DeepXplore only talks about static, state-less DNNs, but the techniques can be extended to handle e.g. Reinforcement Learning (which involves a sequence of inputs).

I feel this can be used for a lot of things (see also the paper’s section titled “Other applications of DNN gradients”). For instance, here is an initial, vague idea: I wrote before about explainable AI and why it would help verification (and lots of other things). Perhaps it is possible to use tricks like DeepXplore’s “Find decision boundaries and supply examples around them” to somehow characterize local areas around an input (“if the input was changed to this, then the answer would have changed to that”), thus helping users understand the DNN.

Other comments

Here is a bunch of unorganized comments / questions I got by reading this paper:

ML systems probably can’t handle all safety issues alone:

As I described in the section “RL and safety” here, most people assume DNNs alone can’t take into account all the safety-related edge cases (for various reasons, e.g. because too much training on error cases will skew the normal behavior, and because you want to see the code to convince yourself that it matches the safety specifications). Thus one needs some kind of non-DNN “safety wrapper” (I mention three kinds of such wrappers there).

If that’s the case, then (assuming DeepXplore is limited to DNN systems) it can only do part of the job, and for each “bug” it finds, we should then present it to the bigger system to see if the wrapper does the job. This may or may not be easy.

The authors replied:

You are right. Any real world self-driving cars will have some sort of wrapper to handle the underlying DNN decisions instead of directly relying on DNNs to handle all error conditions. The wrapper should also be a good testing target.

About Adversarial examples:

I talked about them here, and said:

This article says that the same fooling examples work for many kinds of ML systems (DNNs, SVNs, decision trees, whatever). Finally, this article tries to explain why this is: fooling examples should often transfer well to all linear models. It also claims that the correctly-classified inputs form only “thin manifolds” in the input space. These are surrounded by fooling examples, but most of the input space consists of “rubbish examples” – inputs which are far from anything considered during the training.

If indeed adversarial examples which fool one DNN implementation also tend to fool other DNN implementations (performing the same job), maybe buggy examples also do. In that case, checking-by-comparison-between-K-implementations won’t work, unless K is very large.

To which the authors replied:

The transferability is indeed very interesting observation. In our experiments we found that while some input behaviors are transferrable a lot of them are not. In fact, the differential testing process’s goal is to find such non-transferable behaviors.

Also note that if the space of “rubbish examples” is in fact huge, then we need to somehow split it into “not considered in training but should have been” (like “driving on the left”) and “not considered in training for good reason” (like “cars flying above each other in space”). This gets back to the tough “constraint problem” mentioned above.

Notes

I’d like to thank the authors of the DeepXplore paper (Kexin Pei, Yinzhi Cao, Junfeng Yang and Suman Jana) for the discussion. I’d also like to thank Matthias Woehrle, Sandeep Desai and Gil Amid for reviewing previous drafts of this post.

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)