Using program induction for verification

Summary: I discussed before (e.g. here) how connecting rule-based verification to the rule-less, amorphous Machine Learning world is really hard, and yet necessary. The current post talks about a somewhat-exotic technique called Program Induction (PI), and how it might (eventually) help bridge that gap.

What’s program induction

Background: I always liked the idea of synthesizing “readable formulas” from data. Some examples of that are Lipson’s 2009 paper, Shapiro’s Inductive Logic Programming , and Gamalon (intriguing, but lacks details).  I also liked the idea of “Neural networks as programs” (e.g. see this excellent post on Colah’s blog).

Last week I met Subramanian Ramamoorthy (Ram). Ram splits his time between the University of Edinburgh and FiveAI (which I mentioned here – they really seem to understand verification). I was pleasantly surprised to find that Ram has already thought about much of this, and even published some related papers (see below). Here are some of the ideas that came up during the conversation (and some I thought about later).

Their PI paper: The paper Explaining transition systems through program induction (pdf), by Ram and Svetlin Penkov (also of the University of Edinburgh and FiveAI) talks about the idea of searching through a space of programs, looking for “the simplest program which matches some observations” (note that here, and below, I am simplifying terribly).

Here is the main idea: You can view back-propagation (the process used to train Neural Networks) as a process for constructing a computational graph (essentially a program) which translates correctly from inputs to outputs. It is just that “normal” back-propagation constructs a very simple (and large) program made up of the few constructs used by NNs (“multiply by weight”, “add” etc.). But one could, in principle, use the same gradient descent machinery to construct programs made of other, user-supplied terms (primitives and connectors).

This is what they (more-or-less) do. And the result is a program which is usually short, and yet is a good description of all observations.

Examples: The paper describes using this technique to create small “programs” (in a functional LISP-like language) for three example domains:

  • Describing the behavior of some physical systems like a pendulum
    • Terms: Acceleration, velocity, position (x), +, -, *, constants
    • Example result: acceleration = x*x – 4.8*x
  • Describing the behavior of an agent (specifically, a Deep Q Network) trained to play ATARI Pong
    • Terms: next_move, ball_y, agent_y, +, -, *, constants
    • Example result: next_move = 86*ball_y – 0.85*agent_y
  • Describing the behavior of a human building a cube tower
    • Terms: pick, place, cube<n>, unspecified-parameter
    • Example result: pick cube_5; place above cube_4; pick cube_1; place above cube_5; …
    • Note that the machine did not really say “above”, it had a value for the unspecified parameter which humans have interpreted as “above”

Some comments:

  • I found the “Related work” chapter a good survey of, well, related work
  • It should be possible to create separate ML-based systems to recognize user-defined terms (like “above”), such that these terms could then be used in the program construction
  • They started playing with adding general “operators” (program constructs) to the language, but this is at an early stage
  • The discussion below is about using some “future PI”, not necessarily the thing described in the paper

Implications for verification

PI should be helpful for several things, e.g. Explainable AI (a topic I discussed here). Below I’ll just talk about how it might be useful for dynamic verification.

I’ll divide this into three areas (though there is a lot of overlap):

  • Monitoring and checking
  • Coverage maximization
  • Anomaly detection and creating new scenario / coverage definitions

Monitoring and checking: I devoted a whole blog post to discussing Where Machine Learning meets rule-based verification. In it, I asked “what can be done with ML and what must still be done by a human”, and said:

For instance, consider verifying the parametrized scenario “Our AV overtakes another car”: Recognition of “an overtake” is fuzzy, so it should perhaps be ML based. Checking for errors (“Did it take undue risks? Was it too timid?”) is also fuzzy and thus hard to express in rules, so perhaps we just need to train some “overtake-grading-and-checking network” (via labeled examples). And so on – let’s forget about rule-based requirements, and just do this all-ML.

But this feels wrong. There should be some “logical”, rule-based list of requirements for at least the black-and-white cases (“Don’t collide with the car you are overtaking, except in very weird cases”).

With PI, we could perhaps do the following:

  • Use PI to create a program / model of “what an overtake is”, and then have a user manually modify that to create a parameterized “overtake recognizer”
  • Similarly, take lots of videos of overtakes, and have users categorize each overtake as good, somewhat bad (e.g. too timid) and bad. Have PI create programs / models for the somewhat bad and bad (ideally based on the initial “what is an overtake” model). Have an expert user inspect that and create a parameterized “overtake checker”

Note that the overtake checker could be created incrementally, using both PI outputs and just plain common sense. For instance, in the above post I wrote:

Consider again our “overtake” scenario: Let’s say that initially we just have the black-and-white rule “Don’t collide with the car you are overtaking, except in very weird cases” and everything else is considered “gray”. Then we take the results of many random overtake runs, and use ML to extract sub-scenarios, areas of concerns and parameters. Then, with some human help, we write new rules for these sub-scenarios, thus reducing the no-rule “gray area”, and so on.

With PI, we could hopefully be able to cluster the behavior in the gray area into multiple sub-domains, each characterized by a small “program”. We could then take those programs and turn them into sub-checkers (and also into scenarios which exercise those sub-checkers in various interesting ways).

Which brings us to:

Coverage maximization: Coverage maximization is a general name for “any technique which helps you automatically reach more test coverage quicker”. I discussed coverage maximization in several posts, including this post about using ML for coverage maximization.

ML-based coverage maximization may have a similar problem of combining user-defined scenarios with unreadable, ML-based optimizations. A possible solution is to have PI create readable “recipes” for getting-to-coverage-sub-goals, and combining those (perhaps using a non-ML machinery) with user-defined scenarios.

Note that in all this I implied (without going into details) that the resulting models/programs are “mostly-declarative”, and can be used as both checkers and generators.

Anomaly detection: Suppose you already performed many runs according to your current scenarios / coverage definitions, and collected a coverage database C. Suppose further that you have a set of S of run traces (which you got from either actual recordings of cars driving around, or from the runs mentioned above). You would like to now answer two related questions:

  • Which of the runs in S represent anomalies, i.e. points in the execution space which are absent / rare in C
  • Having found / clustered those runs into N “anomaly groups”, how do you (semi-) automatically create scenario / coverage definitions for each such group, so as to run many (meaningful) variations on it

Plain old ML-based clustering may be enough for the anomaly detection itself, but perhaps we can use PI to make it more “symbolic”.

The second problem is especially hard: For instance, in the Waymo article I mentioned here, they make it sound like any time they discover a new aspect of driving they did not consider (like a two-lane roundabout), human beings have to generalize it manually.

But with PI, perhaps we can create “programs which describe the anomalous sub-domains” as suggested above. Let me clarify this via an example (again, simplifying everything):

Suppose we have many recordings of cars negotiating a traffic circle (after sensor fusion and all that). We could use Inverse Reinforcement Learning to train an ML system to negotiate such traffic circles, or we could use PI to create a rule-based system.

Say we told the PI to use terms like “There is a car already in the circle, at lane x”, “wait”, “go” etc.. It might come back with a model M1, consisting of bunch of related, interacting rules like “If there is a car in the circle and close to me then wait, else go”. We can then convert M1 into a parameterized scenario, and test it with many parameter values (e.g. what if the car is close, but not very close).

Say we continue feeding more recordings into the PI system, and some of those are unusual / don’t fit M1. Ideally, a good PI will now produce M2, which says e.g. “like M1, but whenever a police car with a wailing siren approaches, all cars give it a right of way in the circle”.

That’s pretty good – it means the PI system detected an anomaly (behavior when a police car is approaching) and wrote a small “difference rule” for it (the difference between M1 and M2). Now we can (hopefully) create generalized scenarios for the anomaly and combine them with other things.

How realistic is this?

PI seems promising, but does not seem ready for prime time yet.

What’s plan B? What are the current best non-PI solutions? Well, there is no single best solution. For monitoring and coverage, we may need to continue using “manual” rules. Similarly, once an anomaly is found, somebody will have to generalize it manually (as per that Waymo article).

For coverage maximization, two promising directions I mentioned before are “novelty coverage” and “criticality coverage”. Novelty coverage (based on the distance of a new run from the runs so far, as measured over the user-specified coverage definitions) simply helps you “go to places you did not visit before”.

This is a popular technique in security-related fuzzing – e.g. AFL (described here) does it without using ML. Here is a paper (pdf) about ML-based novelty search in the context of simulation, and here is a paper about using the similar concept of “Curiosity-driven exploration” to help a Reinforcement Learning agent learn faster.

Criticality coverage is another good way to direct test generation. It measures how close we are to some (suspected) “danger state” (see here for one example). A new paper from people at MIT talks about the related idea of using the difference in suggested steering angle (between two different ML-based driving systems) as an indication of interesting edge cases.

That is not a bad plan B. But PI, if it worked well, could really make a big difference. Will it eventually “work well”?

Issues with PI: I am really no expert in this, but here are some potential issues with PI:

  • It is somehow hard for me to imagine a PI system being able to write the full, readable rule-set for, say, a sensor fusion NN. Will it really be able to create the kind of rules that the machine vision guys tried forever to perfect (and eventually admitted defeat to NNs)? Note, however, that people who understand this area told me “eventually, possibly yes”.
  • Much depends on the choice of terms. I referred here to the paper “Giving visual explanations” which describes how to generate pretty-good textual explanations (for a wild-bird classifier). Except that the explanations may have nothing to do with how the classifier actually worked – they completely depend on user-defined vocabulary. A similar problem can await PI, but hopefully it will be easy to spot – simply check how stable the resulting program is when more, and different, input-output pairs are processed: If the program continues to work for all those new samples, then we may as well call it “the right model”.
  • Even good models will sometimes need to be updated as more data comes in. Thus, even in the best case, we need to think of PI not as “supplying rules” but rather as “helping in an ongoing flow of looking for rules”.
  • PI systems are perhaps stuck with not being able to distinguish between “X should never happen” and “X has very, very low probability”. This may be OK if you want to use the resulting models to generate scenarios using the expected distribution (i.e. with the same distribution as the observations from which the models were learned), but it is a real problem for us verification people: Many bad bugs are only revealed by a sequence of infrequent events, so we like to use corner-case-revealing, bug-finding distributions. But to do that, we need to know if a corner case is impossible or just fairly improbable.

So where will this go? OK, so PI is not going to solve all our verification problems overnight. On the other hand, there is quite a bit of work going on in PI and similar areas (e.g. see the proceedings of a related workshop). So my (very vague) intuition is that this will eventually be very helpful, but will continue to require manual work. Perhaps we’ll end up with some kind of an interactive “PI studio”, where the user does things like:

  • Choose among a bunch of candidate programs
  • Modify the “allowed terms” and see what new program comes out
  • Let the system learn “which vocabulary works for which cases”
  • Experiment to see which programs / program styles are more stable to data changes
  • And so on

Overall, PI looks like one possible solution to a very real problem.

Notes

I’d like to thank Subramanian Ramamoorthy, Yaron Kashai, Sandeep Desai, Amiram Yehudai and Gil Amid for commenting on earlier drafts of this post.

In other news, The Register came out with a fairly comprehensive article about verifying AVs, quoting me and others.

 


Leave a Reply