Dynamic verification in one picture

Summary: This post tries to summarize what dynamic verification is, using a single picture. It then puts various verification tools, and diverse verification projects, in the context of that picture. It also explains Coverage Driven Verification (CDV).

The Foretellix blog is about verifying complex systems. However, as I discussed here, there is no agreed-upon verification terminology across the various verification “tribes” (see also the terminology page). Even the term “verification” is disputed (e.g. formal verification people assume it only refers to what they do).

I use the term “verification” to mean roughly “the process of finding bugs (or unintended consequences) in engineered systems before they annoy a customer / destroy a project / kill somebody”. Slightly more formally, it is the process used to gain confidence in the correctness of a system vis-à-vis its (potentially-informal) specification.

This includes both formal verification and dynamic verification (i.e. actually running tests). I also consider validation as a kind of verification (yes, I know this is blasphemy – but see this post).

My last post was about the 2017 Stuttgart symposium on Autonomous Vehicle test and design. Next I was going to describe what I saw there regarding verification frameworks, simulators and scenario libraries, but I realized I first need to explain what those are, and how they fit into a bigger picture.

So I set out to create a picture of a generic verification framework, which would be good for verifying:

AVs via simulation
AVs via test tracks
HW chips via simulation
HW blocks via bounded model checking
Machine Learning (ML) systems via special ML techniques
and much more

The result is this post (which I expect to reference from subsequent posts). I’ll use “AVs verified via simulation” as my running example, but I’ll demonstrate other uses in the “examples” chapter.

Note that the picture below applies mainly to dynamic verification: Generating test instances, running them on some execution platform (simulation, actual system etc.), and monitoring the results to produce coverage data and error messages. It also applies to bounded model checking (but not to the rest of formal verification). Thus, I claim that it describes how most of “practical verification” gets done (see also this post about formal verification). I’ll usually skip the word “dynamic” from now on.

Finally, many of my readers are verification experts, and there is nothing really new in this post. I expect you guys to just scan through it and murmur under your mustaches “Yeah, this looks about right. And if he chooses to use that terminology, I guess it’s a free country”.

OK, here we go:

The picture

Here is that generic (dynamic) verification framework:

dynamic_one_picture

Fig. 1 is intentionally abstract, and thus fits many examples, as I’ll describe in the next chapter. The generic verification framework consists of the following components (note that elliptical balloons correspond to data flowing between components):

The running configuration is the thing that actually runs the test. There may be multiple running configurations, corresponding to multiple execution platforms (simulation, actual test track and so on). Also, there may be sub-configurations within an execution platform – e.g. see this post. Also, some configurations are repeatable and some are not.

The running configuration consists of the Device Under Test (DUT), and everything around it, which we call the Verification Environment (VE).

The VE contains a generator, whose job is to send inputs into the DUT, and a monitor, whose job is to observe the DUT, look for errors and collect coverage. The generator and the monitor usually share some model of the environment. For instance, when the DUT is an AV, the environment would consist of roads, other vehicles etc..

Sometimes there is also a scenario library which describes a bunch of interesting scenarios and their parameters. The scenarios restrict / direct the generator, and may also contain scenario-specific checking and coverage information.

Note that I had to shave a few corners to fit this abstraction: I did not tell you that the DUT is often thought to be embedded in the environment model. And that in CDV, the word “monitor” has a more restricted meaning. And that the checking done by the monitor often uses assertions residing in the environment model, in the monitor itself, in the DUT, or in the scenario library. And so on.

I should really try this line of defense in court one day: “Your honor, I was not lying: I was abstracting.”.

The inputs to the generator are some knob values (an intentionally-vague term referring to various things which influence the generator – at the minimum the name of the current test and the value of the initial random seed). The monitor’s output consists of reached coverage points (see here for a discussion of the various kinds of coverage, like code coverage and functional coverage), as well as (suspected) errors.

This whole thing is usually (but not always) constructed according to some spec / requirement documents, which describe what the DUT should and should not do.

The outputs from many runs go into a multi-run aggregator, a management tool which maintains various databases (of runs, configurations, coverage data, error messages, versions, sessions and so on). It lets you track progress (e.g. against test / verification plans), and do coverage analysis (e.g. see where the coverage “holes” are, see the coverage data projected onto a verification plan). It also helps you cluster failing runs into “failure clusters”, or more generally areas of concern. You can then debug a representative run from each such cluster using various debug tools (e.g. a reverse debugger).

There is usually some (manual or automatic) “maximization” going on, i.e. a process which tries to accelerate the reaching of coverage points and errors by “tweaking the knobs”. Maximization, in a sense, reverses the normal execution flow: It tries to guess the right input sequence leading to uncovered / lightly covered areas.

Other tools: Fig. 1 is necessarily simplified, and does not contain everything (though one could say everything does appear, inside the “other tools” box).

While I think I included the main, “hard-core” bug-finding pieces, others may disagree (“You are calling me miscellaneous?”). And these other tools can be pretty important. Also, some verification activities (e.g. code inspection) don’t need any special tools. And I did not even mention all the related methodologies.

Anyway, here is an incomplete list of those other tools:

Static analysis tools, linters, fault tree analysis tools, …
Requirement tracking tools, safety-case tracking tools, …
Bug tracking tools, change management tools, ….
Tools for creating models, analyzing them, simulating them (not in the time domain), …
Tools for creating scenarios: GUI composition tools, tools for converting recorded data / accident data into scenarios, …
Formal verification tools

I talked a bit about requirement and modeling tools here. A full-fledged framework should probably contain many of these tools, or at least good interfaces to them.

Some examples

Here are some examples of how this flow works in various contexts:

HW verification via simulation: Suppose the DUT is a router chip which accepts Ethernet packets and routes them to various destinations (according to some mapping information, received in “mapping” packets). The environment consists of a description of “the other players” around the DUT (computers, other routers etc.), and of what packets and mappings look like.

This is typically verified (in simulation) using Coverage Driven Verification (CDV): CDV tools can create random instances from class-definitions-with-constraints, so the generator randomizes the configuration of “the other players”, the mapping information, the packet streams and so on, and subjects the DUT to all that. The monitor collects the packets coming out of the DUT, monitors internal states, does some checking (e.g. “All incoming packets should go out”), and collects coverage (including functional coverage, e.g. did we try all packet types, all size ranges, all collision kinds and so on).

Maximization is usually done “manually”: A lot of coverage is likely to be reached already via constrained randomization. Then a verification engineer observes that coverage and the corresponding tests, thinks hard about the kinds of scenarios / inputs which could fill holes in that coverage, writes new tests accordingly, and runs them several times (with different random seeds).

This is a slow process, done by very experienced and expensive people. Hence, there were many attempts to automate it (but no huge successes so far). I described here the idea of using ML for that: Essentially, training an ML-based system on the knobs-to-coverage path, so it will be able to reverse the flow, i.e. come up with knob values (in this case – random field values) which would (probabilistically) reach unfilled coverage points.

See this long post on CDV, functional coverage, and using coverage to optimize the verification process. It also talks about use-case-based notations, which let you use CDV within the space defined by use cases.

HW verification via bounded model checking: This is the one example which uses formal (not dynamic) verification. The DUT is usually a HW block (e.g. a piece of that router chip), written in some Hardware Description Language (HDL). The environment consists of a (smallish) wrapper often written in an HDL. The generator relies on assumptions on the input (written in some assertion language). The monitor uses the implementation (and perhaps some event definitions) for coverage, and some assertions (in that same assertion language) to flag errors.

Maximization here is built-in: This is what model checkers do: as long as everything is “model-checkable” and not too big, the model-checking solver computes the exact sequence of inputs needed to reach the coverage point (or declares it to be unreachable).

ML-system verification via special ML techniques, e.g. DeepXplore: I described the DeepXplore system here. The DUT is a system implemented as a Deep Neural Network. The coverage is neuron (implementation) coverage. The checking is “are the outputs consistent for multiple, similar DNN implementations”?

Here, too, maximization is built-in: Because everything here is made out of “differentiable” neural networks, there is an efficient linear-algebra process to compute the inputs needed to produce a specific output.

See more about verifying ML-based systems (and about using ML for verification).

Security verification via fuzzing: As I described in my maximization post and elsewhere, fuzzing (perhaps the main tool for finding security vulnerabilities) is fairly similar to CDV. Generation emphasizes not-quite-legal inputs, checking is more simplistic (mainly looking for exceptions and illegal memory accesses), and coverage is usually some kind of implementation coverage.

Fuzzers employ various maximization tricks: For instance (see the above post) AFL uses its own random fast-discovery algorithm, while Sage uses symbolic execution (also called concolic execution). Fuzzers can use some model of the environment (but AFL and Sage do not).

AV verification via simulation: Let’s assume the simulated DUT is the full AV (made of HW, SW, ML, physics etc.), and that we use something like CDV. The environment consists of roads, other cars, people and so on.

Assume further that we organize the “verification plan” by scenarios: There will be a long list of parameterized scenarios (like “overtake”, “handle a traffic circle” etc.), each with its own coverage and checks.

AV verification often uses multiple execution platforms. Even assuming just Software-in-the-Loop (SIL) simulations, there may be multiple configurations of models (and related simulators), representing various points on the fidelity-vs-performance curve. Also, some running configurations may include detailed simulation of the various sensors, while others may choose to skip that altogether (see this post on the options for AV sensor simulations, and how they relate to various execution platforms).

AV verification is further complicated because it has to look for both “expected” and “unexpected” bugs, because interactions between AVs and people are hard to verify, and because much of the checking is probabilistic.

Other kinds of testing: The chapter “Even more about coverage” of this post talks about how other kinds of testing (like all-pairs testing, test selection techniques, use-case-based notations and so on) fit into this framework. “Manual” tests also fit this framework, in a rudimentary way.

Elsewhere in my blog I talked about using this framework to verify autonomous robots, Intelligent Autonomous Systems in general, and other complex systems.

Notes

I hope I have been able to show what a dynamic verification framework looks like. One of my next posts will use that as background to discuss what I saw at the Stuttgart symposium. [Added 18-July-2017: Here it is]

I’d like to thank Kerstin Eder, Sandeep Desai, Gil Amid and Thomas (Blake) French for commenting on previous versions of this post.

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)