Summary: This post tries to explain (once and for all?) how the concept of coverage is used to optimize the verification process, what it means to auto-maximize coverage, and how people have tried to do it.
I have been spending some time lately on coverage maximization via ML (which I described here). As is often the case when one tries to automate something previously done “manually” by experts, I found myself going back to basics – questioning basic assumptions and goals, and discussing them with others.
So I thought I’d write up my thoughts on all this:
- What is verification and what can we expect from it
- Rational usage of the verification resources, and how that relates to coverage
- Hard-to-reach coverage, and other coverage notes
- And finally: What is coverage maximization and how to do it
It is a long post. If you are already familiar with CDV, much of this may be obvious (but I hope to be slightly provocative to keep you awake).
Here we go.
What is verification
I have defined verification before as follows:
By “verification” I mean here roughly “the process of finding bugs (or unintended consequences) in engineered systems before they annoy a customer / destroy a project / kill somebody”. This includes both formal verification and dynamic verification (i.e. actually running a test).
but for this post, I’ll use a different definition, which emphasizes efficiency:
Verification is the process of finding and understanding bugs (or demonstrating relative cleanliness), as prioritized by (user-specified) current areas of interest. The job of a verification framework is to make that process efficient.
Obviously, we verify our DUT in the context of the environment(s) in which it should operate, using whatever specs / requirements we have.
This post is about verifying DUTs of all sizes, from small subsystems to big, complex systems. Here are some examples of really-hard-to-verify complex systems, embedded in complex environments:
- Verifying an autonomous vehicle in a complex topology of roads / junctions, with other cars, people etc.
- Verifying an autonomous assistive robot in a complex world made of users, other people, objects, corridors, roads, cars etc.
- Verifying a multi-server database system in a complex network of clients, servers, storage devices etc.
- Verifying a router in a complex topology of communication devices and hosts
- Verifying (HW) cache logic in a complex multi-core fabric with multiple kinds of masters and slaves, each with their own caches
I’ll consider here all aspects of verification: Verifying functionality, safety, performance, security and so on. Also, for this post, I’ll treat validation as part of verification.
Some assumptions about verification
Before getting into more details about how to verify, here are my assumptions:
Verification is never complete: No realistic system is ever bug-free. This should not come as a surprise to you, unless you believe that formal verification can already / soon find all bugs in a big system, in which case read this.
Verification is a resource-bound engineering process: Since it is never done, we can always do more of it. Given limited resources, much of the art of verification is about efficiency – how to get more bug-finding out of the same resources.
Note that this is not a post about how much we should verify: clearly, we will allocate more resources for verifying the braking control system of our AV than for verifying the windshield wipers. This post is only about how to make best use of those resources.
Verification is influenced by current priorities: Our current verification goals will depend on the project, its current state and our role in it. Depending on all that, your “current areas on interest” could be:
- Bugs in the subsystem / sub-functionality under our responsibility
- Only security bugs / only safety bugs / only performance bugs
- Bugs in areas close to changes from the previous release
- Bugs in areas far from already-discovered-but-not-yet-fixed bugs
- Bugs which prevent some urgent next phase in development (e.g. boot bugs)
- Bugs in the main intended use-cases of the system
- Bugs in areas where, if found, this could force a re-design
- Some combination of the above
Verification needs randomized testing: In some (very limited) cases, formal verification can do the job. And in some other cases, our verification needs are low enough that we can live with just some “directed” tests. But most complex-systems verification projects need to rely on an automated dynamic-verification methodology (i.e. performing many runs on various execution platforms). The cheapest way to add variability to such runs is to use randomness.
There are various ways to do that. One of the most successful is Coverage Driven Verification (CDV) – see figure 1:
Most of this post is about the coverage part of CDV, but CDV really has lots of other pieces (which will not be further discussed for lack of space):
Perhaps the most important is constraint-driven generation: generating random instances of data structures – including sequence objects (parametrized threads) – while obeying (structural and value) constraints. This is a step up from just calling random() and inserting the result in the test. There is an important variant of this called use-case generation (mentioned later).
CDV also involves various techniques for automatic checking, sophisticated reuse methodologies, and so on.
Rational usage of verification resources
Let’s start with a simple example (kindly skip if this sounds too trivial):
Suppose we are asked to verify an arithmetic unit, which takes two 32-bit operands and a 2-bit operation code (for add, subtract, multiply and divide), and produces a 32-bit result. Assume we’ll use dynamic verification (formal will probably work here too, but we’ll ignore that – this is just a small example to get us started).
The unit has 32+32+2=66 bits of input. Exhaustively simulating and checking 266 ops will take many lifetimes, so we need to do it smarter. Say we are given a budget of just 1M simulated runs (of one operation each). How should we do it?
Well, we could randomize the operands using a “flat” random distribution, or we could use the “expected distribution” of those operands in actual usage. But both are really bad ideas for finding bugs.
What we should use is a “bug-finding” distribution. Let me explain:
One metaphor is that the design is a big box full of sand, and bugs are (big or small) stones, distributed unevenly in it. Verification is about efficiently finding and removing “enough” of those stones. We find the stones by shooting arrows through the box (each arrow corresponds to a test run), hoping to hear a “ding” when it hits a stone (bug).
Just shooting arrows randomly through the box is inefficient – it is a big box and we have a limited number of arrows. Ideally, we would like to direct the arrows to where the bugs are, but we don’t know where the bugs are (or we would have found them already). But we have hunches about them (based on our intuition, the spec, the implementation and so on).
For instance, say we suspect that two small corners of the box are particularly buggy. We should then perhaps spend 30% of our arrows on each of them, and the remaining 40% on everything else (because there may be bugs there too).
So, back to our arithmetic unit, we start by enumerating some “coverage buckets” we should allocate some runs to:
- For each of the operands, try zero, MAX_INT, MIN_INT, small negative, big negative, small positive, big positive, a random 32-bit number, etc.
- Test all operation codes on all operand groups
- Try to achieve the same for the output. For instance, for subtraction, make both inputs equal, to achieve zero
We would do that without knowing anything about the implementation, because we have an intuition that these buckets could behave differently, and we want to check all potentially-different buckets. And obviously, if we used just random 32-bit numbers, we would probably never get zero or MAX_INT in 1M samples, and perhaps only very few samples in the “small negative” bucket.
Note that the bucket “small negative” should not be represented by just one number: If we run 1K runs with a “small negative” operand, each run should randomize a fresh small negative number.
Suppose we do know something about the implementation. For instance, say we know that when both operands are smaller than 64, multiplication is done via (fast) lookup, else it is done via (slow) serial multiplication. In that case, we would add the following buckets (for multiplication only):
- Both operands smaller than 64, one or both bigger than 64
- One operand is exactly 63 or 64 or 65 (to find off-by-one errors)
- And so on
People who are expert at verifying arithmetic units will probably come up with more interesting buckets.
Using a verification language (like e) we can easily define this kind of coverage (called functional coverage). Technically, we define an event to be covered, and (optionally) a bunch of attributes to be covered (sampled) when the event happens. In our case these could be the two operands and the operation, sampled upon a suitably-defined start_arithmetic_operation event. To cover a specific “temporal sequence of events”, use a suitably-defined “end-of-sequence” event as the coverage event.
CDV tools usually have facilities to read and merge the coverage created by many runs (as well as “formal coverage”), and project the result on a “verification plan” document, so we see how much each verification plan section (aggregation of coverage) was exercised. Thus, coverage is used not just to direct test generation but also to track verification progress.
Verification languages also have features which help fill the buckets (in e we would write some “soft constraints”), but only for input coverage (i.e. values which are directly generated by the VE).
We can do that in 20 to 30 lines of code, and then execute 1M runs and see that we are happy with the resulting coverage.
Unfortunately, in most cases filling the coverage is a lot harder:
The arithmetic unit coverage was simple enough that we could easily force not only its inputs but also its outputs. Here are much harder cases:
- Forcing the output of a complex computation unit, for which we do not have a simple reverse function
- Forcing some state machine deep inside the unit to reach each of its states, when we do not have direct access to the state machine’s inputs (e.g. they are the outputs of some other units)
- Forcing some internal buffer to be full, where the depth of the buffer depends on the rate of incoming packets and the rate of outgoing packets, both of which depend on lots of other internal variables.
- Combinations of the above, such as forcing the buffer to be full/not full while the state machine is in each of its (say) 5 states
Note that Coverage should be “a reflection of our fears”. So, we certainly want to see that we have reached each state of the state machine, and that we have reached both full and non-full buffers. However, if we suspect there is some interaction between the state and the fullness of the buffer, we may want to request a “cross” of the two (5 states * full / not full = 10 buckets).
We may have heuristics to fill the buffer (“speed up incoming traffic while holding up outgoing traffic”), but it is inexact and does not always work. Similarly, for getting to some state. Achieving all 10 buckets is a matter of luck: With enough random simulation, we may get to all of them. But perhaps a few of the 10 are really hard to achieve (e.g. because whatever causes buffer-full also makes some states unlikely). In that case, we may need to do a lot of work to get there.
When verifying complex chips, verification engineers routinely define crosses of several attributes. Defining these crosses is fairly easy, since verification languages contain special syntax to define coverage, including:
- When to sample (i.e. upon which event)
- What to sample (i.e. which attributes and their crosses)
- How to bucket those attributes
- Which values / crosses will never appear
- How to compute the corresponding grade (relative weight, how much to repeat)
- And so on
However, filling these big crosses can be very time-consuming. Obviously, we can’t directly force the conditions we want: We can only try to reach them through the thick gloves of the allowed inputs.
So, automatic coverage maximization (see below) would really help.
Other kinds of coverage
Functional coverage is just one kind of coverage. Here is the bigger picture (taken from the above-mentioned post):
- Normal, bucket-based coverage is the kind of coverage for which it is meaningful to say “I have reached 100%”. This is what most people call coverage, and it comes in two main variants, as explained here:
- Implementation coverage (e.g. code coverage), i.e. which source constructs (e.g. source lines) have been exercised so far
- Functional coverage, i.e. which user-defined DVE-related attributes (and crosses of attributes) have been reached so far
- Pseudo-coverage is the kind of coverage for which “more is better” but does not sum up to 100%. Some example:
- Novelty coverage, which measures how far (using some metric) each new “sample” is from all previous ones
- Criticality coverage, which measures how close we are to some “danger state”. I mentioned some companies maximizing criticality coverage in the last chapter of this post.
- Known bug proximity coverage, which measures how close we are to triggering an already-found bug
We probably need all these kinds for “good” verification. Implementation / code / condition coverage is easy to extract, and is pretty useful: It is an excellent indication of the “buckets of implementation”. However, even reaching 100% is a fairly weak indication of quality: We can get 100% even if someone forgot to implement half the functionality.
This post is about all these kinds of coverage (even though the preceding chapters mostly explained functional coverage).
Even more about coverage
Once is not enough: Filling a coverage bucket just once will usually not uncover all bugs “sitting there”. Each coverage bucket marks a general area to “dig around” for bugs. Thus, we should re-visit each cover bucket multiple times (in different, random ways) to have a reasonable chance of catching the bugs hiding “in that neighborhood”.
BTW, even repeating each coverage bucket many times is not guaranteed to catch all bugs – we may have forgotten to specify some coverage. Also, in general dynamic verification does not come with any absolute guarantees – a sad but unavoidable fact.
Scenario-activation coverage: There are often more bugs in “stress” modes, i.e. when the DUT is held in some “extreme” state for a while. For cache fabrics, some examples are: Lots of end-to-end dirty writes, lots of snoops, and (even better) both at the same time. For AVs, one example is dense traffic during a snowstorm.
Often, we shall define a (parametrized) scenario corresponding to a stress condition, and run it. It may be easier to just cover the activation of this scenario (and its various parameter values) than to cover the actual stress phenomena it is supposed to create (if we are sure that the phenomena indeed get created).
Super-crosses: Suppose that we have already defined N cover buckets, for some “reasonable” N (which we assume we’ll be able to cover). Suppose also that we have S stress scenarios, and G global modes for the whole DUT.
It is tempting to request the “super-cross” of N * S * G combinations (minus the illegal combinations). But this number is clearly too big to fully cover, and in any case some random subset of it will presumably be covered (assuming we try to cover the N buckets multiple times, while holding everything else random).
However, sometimes we do want to cover all N buckets with some subset of these S * G combinations. For instance, we may want to make sure that our AV ran though all (possible) buckets also in dense traffic during a snowstorm.
Sometimes the DUT or the verification environment come in several variants, in which case we may again want to super-cross those variants with any super-cross defined so far. Also, DUTs are often verified using multiple execution platforms (e.g. simulation, actual-HW etc.), and we normally want to super-cross the platforms with the all the (applicable) coverage.
Verification frameworks often do not have a clean way to specify super-crosses, but they do have a way to merge coverage (e.g. from multiple configurations and execution platforms) and display it together.
Use-case-based notations: In addition to “straight” CDV, there are tools which let we specify (parametrized) use cases and then use constrained-random generation within those use cases. The assumption is that the DUT will be used only (or mainly) according to those use cases, so verification should concentrate on them.
In our box-with-sand metaphor, this corresponds to specifying cylinders of various sizes within the box, and shooting arrows only inside them.
Use-case tools are also goal-directed, and quite different from straight CDV (full disclosure – I helped start one of those tools). But for this post I’ll lump them together with CDV.
All-pairs testing: All-pair testing is a simplification of CDV, where we essentially say “we don’t have time to think of which crosses are important, so let’s just cross every two input attributes”. This is usually done without sophisticated constraint technology.
BTW, Test selection techniques are also a simplification of CDV, where we assume we cannot generate new tests – we just try to select an optimal subset from an existing test corpus, according to various coverage criteria.
Fuzzing: Fuzzing is a technique for finding security bugs (vulnerabilities) which is pretty similar to CDV. Much of what I describe here also relates to fuzzing.
Now that we know all about coverage, we can talk about coverage maximization.
Plotting the coverage grade over time (without maximization) typically looks like the gray graph in figure 2:
Coverage grows as we run more tests, then it plateaus, then we (manually) write new scenarios targeted at reaching places which were not reached before, and so on. Sometimes, after finding a bug, we realize we need to define more coverage buckets, so the coverage grade suddenly drops (until we fill them too). And so on.
The gray line represents what we can do with state-of-the-art constraint-based (and use-case-based) generation. But can we do even better?
The idea of coverage maximization is to somehow automatically boost the coverage grade (e.g. as shown by the orange line). This is clearly an excellent idea: it will let us save a lot of resources (or reach much higher confidence with the same resources).
So how can we do that? Many techniques have been tried (usually with very limited success). Here are some:
Machine Learning: ML is the hot new technology, and hopefully we can “learn” how the DVE behaves and how to “push” it to reach more coverage faster. This is covered in this post (which also has some references).
People have tried Genetic Algorithms for similar things, but I think there is vague-but-growing consensus that usually ML can do what GAs do, only better.
Model Based Testing: MBT means guiding generation using some (usually formal) model. For instance, if (in addition to the DUT itself) we get a formal “reference model” of its state machines, we can sometimes use model-checking-like techniques to achieve a “witness” which takes us to a desired state, and use it to guide simulation.
BTW, while I think formal verification is more limited than many people assume, I think formal techniques for guiding dynamic verification is an under-appreciated field.
Note that tools supporting use-case-based notations often use a variant of MBT: The notation sometimes supports specifying not just the “actions” but also their “effect”, in which case we can request a specific effect and have the tool “plan back” the required sequence of actions.
Concolic generation: This technique (from “concrete” + “symbolic”) is used (e.g. in the Sage fuzzer, mentioned here) to maximize code coverage. As the name implies, it first performs some random runs, CDV-style. Then, when it wants to execute the same run, but go to the “else” part of some “if”, it traces back the assignments along the way and produces a few more constraints, which (when added to the original ones) should change the inputs just enough to toggle the condition of that “if”.
Tree-of-runs: This technique (which can be used in conjunction with most of the other ones) works as follows: Suppose we have a run which, at time T, gets “close” to some rare, interesting coverage. What we can do now is somehow save the simulation at time T, and start a whole tree of continuations from that point.
There are various ways to do that: We can fork the original process and then reseed each child to randomize differently. Or we can restart the original run, execute until time T and then reseed.
The Qtronic tool (mentioned in the last chapter of this post) maximizes “criticality coverage” using a similar trick. And the AFL fuzzer (mentioned here) does an interesting variant of this, maximizing a kind of branch coverage (though it produces the full input stream before the actual run).
Consider yourself warned
Maximization is a really-worthwhile goal. However, before you rush out to prototype your own maximizer, please beware:
- Many have tried (e.g. see the references at the end of this post), but few tools made it into widespread, practical usage
- It is really hard to create a general-purpose maximization technique, because the DUT could be anything. Suppose (somewhat absurdly) that the DUT is a chess-playing system, and you want to auto-cover the state where the verification environment beats it. Then your (e.g. ML-based) maximizer has to learn how to beat the DUT in chess.
- Some people don’t bother to write much functional coverage. This may sound like an unforgivable sin, but sometimes we just don’t have time, or we restrict ourselves mainly to “scenario coverage” as described above (and keep our fingers crossed). However, a maximizer needs some number to maximize, so in this case it will probably maximize other kinds of coverage (e.g. implementation coverage, which can be extracted automatically, like AFL does).
- When a maximized run fails, we need to be able to rerun it (for debug purpose), obviously without rerunning the whole multi-run session. Thus, the maximizer needs to write (for each run) some compact information for re-running it.
- Suppose some run reached an important coverage event. If the relevant coverage definition specified that multiple hits improve the grade, we could improve the grade by rerunning it many times with the same seed. But this is clearly useless. Maximization is prone to a similar (but less obvious) pitfall: We need to make sure that it does not get “stuck in one groove” to reach related coverage.
- Coverage (of various kinds) is indispensable for dynamic verification (which is most of verification)
- Coverage maximization is not an easy job, but is well worth doing
I’d like to thank Amiram Yehudai, Gil Amid, Ziv Binyami, Sandeep Desai, Thomas (Blake) French and Kerstin Eder for commenting on earlier versions of this post.