Verification, coverage and maximization: The big picture

Posted on December 23, 2016May 29, 2017 by Yoav Hollander

Summary: This post tries to explain (once and for all?) how the concept of coverage is used to optimize the verification process, what it means to auto-maximize coverage, and how people have tried to do it.

I have been spending some time lately on coverage maximization via ML (which I described here). As is often the case when one tries to automate something previously done “manually” by experts, I found myself going back to basics – questioning basic assumptions and goals, and discussing them with others.

So I thought I’d write up my thoughts on all this:

What is verification and what can we expect from it
Rational usage of the verification resources, and how that relates to coverage
Hard-to-reach coverage, and other coverage notes
And finally: What is coverage maximization and how to do it

It is a long post. If you are already familiar with CDV, much of this may be obvious (but I hope to be slightly provocative to keep you awake).

Here we go.

What is verification

I have defined verification before as follows:

By “verification” I mean here roughly “the process of finding bugs (or unintended consequences) in engineered systems before they annoy a customer / destroy a project / kill somebody”. This includes both formal verification and dynamic verification (i.e. actually running a test).

but for this post, I’ll use a different definition, which emphasizes efficiency:

Verification is the process of finding and understanding bugs (or demonstrating relative cleanliness), as prioritized by (user-specified) current areas of interest. The job of a verification framework is to make that process efficient.

Obviously, we verify our DUT in the context of the environment(s) in which it should operate, using whatever specs / requirements we have.

This post is about verifying DUTs of all sizes, from small subsystems to big, complex systems. Here are some examples of really-hard-to-verify complex systems, embedded in complex environments:

Verifying an autonomous vehicle in a complex topology of roads / junctions, with other cars, people etc.
Verifying an autonomous assistive robot in a complex world made of users, other people, objects, corridors, roads, cars etc.
Verifying a multi-server database system in a complex network of clients, servers, storage devices etc.
Verifying a router in a complex topology of communication devices and hosts
Verifying (HW) cache logic in a complex multi-core fabric with multiple kinds of masters and slaves, each with their own caches

I’ll consider here all aspects of verification: Verifying functionality, safety, performance, security and so on. Also, for this post, I’ll treat validation as part of verification.

Some assumptions about verification

Before getting into more details about how to verify, here are my assumptions:

Verification is never complete: No realistic system is ever bug-free. This should not come as a surprise to you, unless you believe that formal verification can already / soon find all bugs in a big system, in which case read this.

Verification is a resource-bound engineering process: Since it is never done, we can always do more of it. Given limited resources, much of the art of verification is about efficiency – how to get more bug-finding out of the same resources.

Note that this is not a post about how much we should verify: clearly, we will allocate more resources for verifying the braking control system of our AV than for verifying the windshield wipers. This post is only about how to make best use of those resources.

Verification is influenced by current priorities: Our current verification goals will depend on the project, its current state and our role in it. Depending on all that, your “current areas on interest” could be:

Bugs in the subsystem / sub-functionality under our responsibility
Only security bugs / only safety bugs / only performance bugs
Bugs in areas close to changes from the previous release
Bugs in areas far from already-discovered-but-not-yet-fixed bugs
Bugs which prevent some urgent next phase in development (e.g. boot bugs)
Bugs in the main intended use-cases of the system
Bugs in areas where, if found, this could force a re-design
Some combination of the above

Verification needs randomized testing: In some (very limited) cases, formal verification can do the job. And in some other cases, our verification needs are low enough that we can live with just some “directed” tests. But most complex-systems verification projects need to rely on an automated dynamic-verification methodology (i.e. performing many runs on various execution platforms). The cheapest way to add variability to such runs is to use randomness.

There are various ways to do that. One of the most successful is Coverage Driven Verification (CDV) – see figure 1:

cdv_picture

Most of this post is about the coverage part of CDV, but CDV really has lots of other pieces (which will not be further discussed for lack of space):

Perhaps the most important is constraint-driven generation: generating random instances of data structures – including sequence objects (parametrized threads) – while obeying (structural and value) constraints. This is a step up from just calling random() and inserting the result in the test. There is an important variant of this called use-case generation (mentioned later).

CDV also involves various techniques for automatic checking, sophisticated reuse methodologies, and so on.

Rational usage of verification resources

Let’s start with a simple example (kindly skip if this sounds too trivial):

Suppose we are asked to verify an arithmetic unit, which takes two 32-bit operands and a 2-bit operation code (for add, subtract, multiply and divide), and produces a 32-bit result. Assume we’ll use dynamic verification (formal will probably work here too, but we’ll ignore that – this is just a small example to get us started).

The unit has 32+32+2=66 bits of input. Exhaustively simulating and checking 2⁶⁶ops will take many lifetimes, so we need to do it smarter. Say we are given a budget of just 1M simulated runs (of one operation each). How should we do it?

Well, we could randomize the operands using a “flat” random distribution, or we could use the “expected distribution” of those operands in actual usage. But both are really bad ideas for finding bugs.

What we should use is a “bug-finding” distribution. Let me explain:

One metaphor is that the design is a big box full of sand, and bugs are (big or small) stones, distributed unevenly in it. Verification is about efficiently finding and removing “enough” of those stones. We find the stones by shooting arrows through the box (each arrow corresponds to a test run), hoping to hear a “ding” when it hits a stone (bug).

Just shooting arrows randomly through the box is inefficient – it is a big box and we have a limited number of arrows. Ideally, we would like to direct the arrows to where the bugs are, but we don’t know where the bugs are (or we would have found them already). But we have hunches about them (based on our intuition, the spec, the implementation and so on).

For instance, say we suspect that two small corners of the box are particularly buggy. We should then perhaps spend 30% of our arrows on each of them, and the remaining 40% on everything else (because there may be bugs there too).

So, back to our arithmetic unit, we start by enumerating some “coverage buckets” we should allocate some runs to:

For each of the operands, try zero, MAX_INT, MIN_INT, small negative, big negative, small positive, big positive, a random 32-bit number, etc.
Test all operation codes on all operand groups
Try to achieve the same for the output. For instance, for subtraction, make both inputs equal, to achieve zero

We would do that without knowing anything about the implementation, because we have an intuition that these buckets could behave differently, and we want to check all potentially-different buckets. And obviously, if we used just random 32-bit numbers, we would probably never get zero or MAX_INT in 1M samples, and perhaps only very few samples in the “small negative” bucket.

Note that the bucket “small negative” should not be represented by just one number: If we run 1K runs with a “small negative” operand, each run should randomize a fresh small negative number.

Suppose we do know something about the implementation. For instance, say we know that when both operands are smaller than 64, multiplication is done via (fast) lookup, else it is done via (slow) serial multiplication. In that case, we would add the following buckets (for multiplication only):

Both operands smaller than 64, one or both bigger than 64
One operand is exactly 63 or 64 or 65 (to find off-by-one errors)
And so on

People who are expert at verifying arithmetic units will probably come up with more interesting buckets.

Using a verification language (like e) we can easily define this kind of coverage (called functional coverage). Technically, we define an event to be covered, and (optionally) a bunch of attributes to be covered (sampled) when the event happens. In our case these could be the two operands and the operation, sampled upon a suitably-defined start_arithmetic_operation event. To cover a specific “temporal sequence of events”, use a suitably-defined “end-of-sequence” event as the coverage event.

CDV tools usually have facilities to read and merge the coverage created by many runs (as well as “formal coverage”), and project the result on a “verification plan” document, so we see how much each verification plan section (aggregation of coverage) was exercised. Thus, coverage is used not just to direct test generation but also to track verification progress.

Verification languages also have features which help fill the buckets (in e we would write some “soft constraints”), but only for input coverage (i.e. values which are directly generated by the VE).

We can do that in 20 to 30 lines of code, and then execute 1M runs and see that we are happy with the resulting coverage.

Unfortunately, in most cases filling the coverage is a lot harder:

Hard-to-reach coverage

The arithmetic unit coverage was simple enough that we could easily force not only its inputs but also its outputs. Here are much harder cases:

Forcing the output of a complex computation unit, for which we do not have a simple reverse function
Forcing some state machine deep inside the unit to reach each of its states, when we do not have direct access to the state machine’s inputs (e.g. they are the outputs of some other units)
Forcing some internal buffer to be full, where the depth of the buffer depends on the rate of incoming packets and the rate of outgoing packets, both of which depend on lots of other internal variables.
Combinations of the above, such as forcing the buffer to be full/not full while the state machine is in each of its (say) 5 states

Note that Coverage should be “a reflection of our fears”. So, we certainly want to see that we have reached each state of the state machine, and that we have reached both full and non-full buffers. However, if we suspect there is some interaction between the state and the fullness of the buffer, we may want to request a “cross” of the two (5 states * full / not full = 10 buckets).

We may have heuristics to fill the buffer (“speed up incoming traffic while holding up outgoing traffic”), but it is inexact and does not always work. Similarly, for getting to some state. Achieving all 10 buckets is a matter of luck: With enough random simulation, we may get to all of them. But perhaps a few of the 10 are really hard to achieve (e.g. because whatever causes buffer-full also makes some states unlikely). In that case, we may need to do a lot of work to get there.

When verifying complex chips, verification engineers routinely define crosses of several attributes. Defining these crosses is fairly easy, since verification languages contain special syntax to define coverage, including:

When to sample (i.e. upon which event)
What to sample (i.e. which attributes and their crosses)
How to bucket those attributes
Which values / crosses will never appear
How to compute the corresponding grade (relative weight, how much to repeat)
And so on

However, filling these big crosses can be very time-consuming. Obviously, we can’t directly force the conditions we want: We can only try to reach them through the thick gloves of the allowed inputs.

So, automatic coverage maximization (see below) would really help.

Other kinds of coverage

Functional coverage is just one kind of coverage. Here is the bigger picture (taken from the above-mentioned post):

Normal, bucket-based coverage is the kind of coverage for which it is meaningful to say “I have reached 100%”. This is what most people call coverage, and it comes in two main variants, as explained here:
- Implementation coverage (e.g. code coverage), i.e. which source constructs (e.g. source lines) have been exercised so far
- Functional coverage, i.e. which user-defined DVE-related attributes (and crosses of attributes) have been reached so far
Pseudo-coverage is the kind of coverage for which “more is better” but does not sum up to 100%. Some example:
- Novelty coverage, which measures how far (using some metric) each new “sample” is from all previous ones
- Criticality coverage, which measures how close we are to some “danger state”. I mentioned some companies maximizing criticality coverage in the last chapter of this post.
- Known bug proximity coverage, which measures how close we are to triggering an already-found bug

We probably need all these kinds for “good” verification. Implementation / code / condition coverage is easy to extract, and is pretty useful: It is an excellent indication of the “buckets of implementation”. However, even reaching 100% is a fairly weak indication of quality: We can get 100% even if someone forgot to implement half the functionality.

This post is about all these kinds of coverage (even though the preceding chapters mostly explained functional coverage).

Even more about coverage

Once is not enough: Filling a coverage bucket just once will usually not uncover all bugs “sitting there”. Each coverage bucket marks a general area to “dig around” for bugs. Thus, we should re-visit each cover bucket multiple times (in different, random ways) to have a reasonable chance of catching the bugs hiding “in that neighborhood”.

BTW, even repeating each coverage bucket many times is not guaranteed to catch all bugs – we may have forgotten to specify some coverage. Also, in general dynamic verification does not come with any absolute guarantees – a sad but unavoidable fact.

Scenario-activation coverage: There are often more bugs in “stress” modes, i.e. when the DUT is held in some “extreme” state for a while. For cache fabrics, some examples are: Lots of end-to-end dirty writes, lots of snoops, and (even better) both at the same time. For AVs, one example is dense traffic during a snowstorm.

Often, we shall define a (parametrized) scenario corresponding to a stress condition, and run it. It may be easier to just cover the activation of this scenario (and its various parameter values) than to cover the actual stress phenomena it is supposed to create (if we are sure that the phenomena indeed get created).

Super-crosses: Suppose that we have already defined N cover buckets, for some “reasonable” N (which we assume we’ll be able to cover). Suppose also that we have S stress scenarios, and G global modes for the whole DUT.

It is tempting to request the “super-cross” of N * S * G combinations (minus the illegal combinations). But this number is clearly too big to fully cover, and in any case some random subset of it will presumably be covered (assuming we try to cover the N buckets multiple times, while holding everything else random).

However, sometimes we do want to cover all N buckets with some subset of these S * G combinations. For instance, we may want to make sure that our AV ran though all (possible) buckets also in dense traffic during a snowstorm.

Sometimes the DUT or the verification environment come in several variants, in which case we may again want to super-cross those variants with any super-cross defined so far. Also, DUTs are often verified using multiple execution platforms (e.g. simulation, actual-HW etc.), and we normally want to super-cross the platforms with the all the (applicable) coverage.

Verification frameworks often do not have a clean way to specify super-crosses, but they do have a way to merge coverage (e.g. from multiple configurations and execution platforms) and display it together.

Use-case-based notations: In addition to “straight” CDV, there are tools which let we specify (parametrized) use cases and then use constrained-random generation within those use cases. The assumption is that the DUT will be used only (or mainly) according to those use cases, so verification should concentrate on them.

In our box-with-sand metaphor, this corresponds to specifying cylinders of various sizes within the box, and shooting arrows only inside them.

Use-case tools are also goal-directed, and quite different from straight CDV (full disclosure – I helped start one of those tools). But for this post I’ll lump them together with CDV.

All-pairs testing: All-pair testing is a simplification of CDV, where we essentially say “we don’t have time to think of which crosses are important, so let’s just cross every two input attributes”. This is usually done without sophisticated constraint technology.

BTW, Test selection techniques are also a simplification of CDV, where we assume we cannot generate new tests – we just try to select an optimal subset from an existing test corpus, according to various coverage criteria.

Fuzzing: Fuzzing is a technique for finding security bugs (vulnerabilities) which is pretty similar to CDV. Much of what I describe here also relates to fuzzing.

Coverage maximization

Now that we know all about coverage, we can talk about coverage maximization.

Plotting the coverage grade over time (without maximization) typically looks like the gray graph in figure 2:

cov_over_time

Coverage grows as we run more tests, then it plateaus, then we (manually) write new scenarios targeted at reaching places which were not reached before, and so on. Sometimes, after finding a bug, we realize we need to define more coverage buckets, so the coverage grade suddenly drops (until we fill them too). And so on.

The gray line represents what we can do with state-of-the-art constraint-based (and use-case-based) generation. But can we do even better?

The idea of coverage maximization is to somehow automatically boost the coverage grade (e.g. as shown by the orange line). This is clearly an excellent idea: it will let us save a lot of resources (or reach much higher confidence with the same resources).

So how can we do that? Many techniques have been tried (usually with very limited success). Here are some:

Machine Learning: ML is the hot new technology, and hopefully we can “learn” how the DVE behaves and how to “push” it to reach more coverage faster. This is covered in this post (which also has some references).

People have tried Genetic Algorithms for similar things, but I think there is vague-but-growing consensus that usually ML can do what GAs do, only better.

Model Based Testing: MBT means guiding generation using some (usually formal) model. For instance, if (in addition to the DUT itself) we get a formal “reference model” of its state machines, we can sometimes use model-checking-like techniques to achieve a “witness” which takes us to a desired state, and use it to guide simulation.

BTW, while I think formal verification is more limited than many people assume, I think formal techniques for guiding dynamic verification is an under-appreciated field.

Note that tools supporting use-case-based notations often use a variant of MBT: The notation sometimes supports specifying not just the “actions” but also their “effect”, in which case we can request a specific effect and have the tool “plan back” the required sequence of actions.

Concolic generation: This technique (from “concrete” + “symbolic”) is used (e.g. in the Sage fuzzer, mentioned here) to maximize code coverage. As the name implies, it first performs some random runs, CDV-style. Then, when it wants to execute the same run, but go to the “else” part of some “if”, it traces back the assignments along the way and produces a few more constraints, which (when added to the original ones) should change the inputs just enough to toggle the condition of that “if”.

Tree-of-runs: This technique (which can be used in conjunction with most of the other ones) works as follows: Suppose we have a run which, at time T, gets “close” to some rare, interesting coverage. What we can do now is somehow save the simulation at time T, and start a whole tree of continuations from that point.

There are various ways to do that: We can fork the original process and then reseed each child to randomize differently. Or we can restart the original run, execute until time T and then reseed.

The Qtronic tool (mentioned in the last chapter of this post) maximizes “criticality coverage” using a similar trick. And the AFL fuzzer (mentioned here) does an interesting variant of this, maximizing a kind of branch coverage (though it produces the full input stream before the actual run).

Consider yourself warned

Maximization is a really-worthwhile goal. However, before you rush out to prototype your own maximizer, please beware:

Many have tried (e.g. see the references at the end of this post), but few tools made it into widespread, practical usage
It is really hard to create a general-purpose maximization technique, because the DUT could be anything. Suppose (somewhat absurdly) that the DUT is a chess-playing system, and you want to auto-cover the state where the verification environment beats it. Then your (e.g. ML-based) maximizer has to learn how to beat the DUT in chess.
Some people don’t bother to write much functional coverage. This may sound like an unforgivable sin, but sometimes we just don’t have time, or we restrict ourselves mainly to “scenario coverage” as described above (and keep our fingers crossed). However, a maximizer needs some number to maximize, so in this case it will probably maximize other kinds of coverage (e.g. implementation coverage, which can be extracted automatically, like AFL does).
When a maximized run fails, we need to be able to rerun it (for debug purpose), obviously without rerunning the whole multi-run session. Thus, the maximizer needs to write (for each run) some compact information for re-running it.
Suppose some run reached an important coverage event. If the relevant coverage definition specified that multiple hits improve the grade, we could improve the grade by rerunning it many times with the same seed. But this is clearly useless. Maximization is prone to a similar (but less obvious) pitfall: We need to make sure that it does not get “stuck in one groove” to reach related coverage.

To summarize:

Coverage (of various kinds) is indispensable for dynamic verification (which is most of verification)
Coverage maximization is not an easy job, but is well worth doing

Notes

I’d like to thank Amiram Yehudai, Gil Amid, Ziv Binyami, Sandeep Desai, Thomas (Blake) French and Kerstin Eder for commenting on earlier versions of this post.

11 thoughts on “Verification, coverage and maximization: The big picture”

Tudor Timi says:

December 23, 2016 at 1:37 pm

I have some random thoughts on the subject. I’m not sure if this blog is an active forum, but it would be nice to converse with others.

The problem with coverage isn’t just what to sample, but when to sample. Defining the events when coverage gets sampled is just as important as deciding what values get piled into buckets. (As a side discussion, temporal coverage aspects are where modern HVLs are lacking. For example, it’s not so easy to specify that after doing A or B the design took ‘x’ cycles to respond with C or D. It’s doable, but with a lot of effort.)

Because sequentiality is important, this means that a coverage maximizer needs to know that it has to start transactions in certain sequences.and understand how these sequences affect the overall coverage. My question here is “at what granularity should the maximizer tweak the tests?”. Do we give it an existing test and say “you can fiddle with the constraints on items, but the sequential behavior is already encoded in the test”? Do we give it full control of starting transactions and let it run wild? The former is probably easier to implement, but would probably bring less. The latter is much more difficult to implement and train, but could potentially eliminate writing tests by hand altogether.

I would assume that a mixture of the two approaches from above is the best solution. Instead of making the maximizer start individual transactions one by one (and have it learn about sequantial behavior), we could give it pre-packaged sequences of transactions, representing higher operations the design can perform and have it start those. This is more akin to how verification is currently done today. (Indeed, remove occurrences of “maximizer” and replace them with “verification engineer” and instead of a discussion about automatic coverage maximization we could have a discussion about better test writing.). I would dare to call a solution along these lines “assisted coverage maximization” (trademark pending), because “automatic” has a very strong connotation of “plug and play” associated with it. Such a solution would also follow a better separation of concerns from the point of view of user/EDA vendor split. EDA vendors could provide the framework for measuring, reasoning and injecting, while users would “teach” the tool how to walk, tie its own shoes, do multi-variable calculus.

Loading...

Reply
1. Yoav Hollander says:
  
  December 23, 2016 at 4:20 pm
  
  Excellent comments. I’ll give some answers below, with the hope that others will join the discussion.
  
  You are absolutely correct that the question of “what is a maximizer allowed to do” is pretty important. If we assume the EDA world and UVM / eRM, then by default we are only “allowed” to tweak generation. Note BTW that I am not necessarily (or even mainly) targeting the EDA world in this blog.
  
  As you imply, to have “official permission” to play with timing, we probably need some new constructs. For instance, if there was a construct saying “wait here between X and Y cycles, but no later than event Z”, then our (ML-based, or human) maximizer could have created a separate piece of logic which could end this wait “in the middle”.
  
  I also liked your comments about the tradeoffs between tool capabilities and “fine tuning” by the verification engineer. My sources tell me, though, that the reason some fairly-ambitious maximizer projects failed is that they requested too much from the user, so this is a delicate tradeoff.
  
  Loading...
  
  Reply
  1. Tudor Timi says:
    
    December 26, 2016 at 9:43 am
    
    At the same time, I’ve always seen randomization and coverage as two sides of the same coin. You randomize stuff and because you don’t know what exactly gets stimulated, you collect coverage, by measuring what goes on. How does this fit in with portable stimulus? My impression up to now was that the goal is to define stimulus at an abstract level, by modeling the input legal space (similar to what constraints do for single transactions, but with an added layer on top for temporal behavior) and then the tool takes this and starts creating tests from this. The tool also reports what has been tried (i.e. what was covered). In such a situation, where the user doesn’t explicitly define coverage anymore (at least not separate from the stimulus, as in CRV), do we still need coverage maximization?
    
    Loading...
Yoav Hollander says:

December 26, 2016 at 3:35 pm

Hi Tudor

I can’t talk to the current portable stimulus discussions (I am no longer involved in that). In the Perspec tool (which I participated in) one can indeed specify temporal / data crosses, and have the solver auto-fill whatever is solvable.

However, I would not consider this a major departure from coverage-as-we-know-it. You still decide which fields / paths you want to cross (even though this is done via some GUI), and you should absolutely approve manually the unsolvable combinations (there is a convenient way to do that within the tool).

But mainly, this style of auto-filling (which I mentioned under Model-Based Testing in the post) is just a small part of the coverage you could define: People who use Perspec (to stick to that example) still collect a lot of run-time functional coverage, code coverage and so on, and even cross it with the “gen-time-solvable” coverage mentioned above. And all that coverage is a good candidate for maximization.

Loading...

Reply
1. Tudor Timi says:
  
  December 27, 2016 at 1:38 pm
  
  I get what you mean. If the aspect of interest is “in this particular state I did this”, it’s easier to define coverage in terms of whitebox signals, reference models, etc. instead of just the stimulus (i.e. transaction properties and temporal relationships between transactions). While it would be theoretically possible to do the latter, this isn’t trivial and conceptually requires computing the “inverse” function of the design. This would mean that it’s more difficult to define coverage if for everything we would need to map it to the exact inputs needed to stimulate that specific “coverage space”.
  
  The use case for maximization here (and everywhere, for that matter) is to save time when writing the coverage definitions and to have the tool figure out how to fill them. This is in some respects totally different from portable stimulus. For the former the user tells the tool: “I want to see this, make it so!”; while for the latter the user tells the tool “You can do all of these things, start doing them and report what you’ve already tried”. Like with other things (e.g. formal vs. CRV), one can’t expect to be able to do the job with only one engine (in this particular discussion, for test writing).
  
  Loading...
  
  Reply
Tudor Timi says:

December 27, 2016 at 2:03 pm

Since we’re talking about coverage here, I’ve got some more random thoughts. You mention that “[auto-filling for model-based testing] is only a small of the coverage [one] could define” and mention code coverage or other functional coverage as examples.

My computer science knowledge is limited at best, but I wonder if, from a theoretical standpoint, it isn’t possible to map all of these additional flavors of coverage to stimulus coverage (and by this I mean coverage for all possible degrees of freedom one has when stimulating the DUT).

In the extreme case, one would want to cover all possible sequences of all possible input values. This isn’t only infeasible, it’s also unnecessary, since designs are meant to operate under controlled environments. At a first level, we solve this by abstracting things and thinking in terms of transactions instead of random signal changes. I wonder if it would be somehow possible to go one step further and create higher level abstractions. Some might be generic to all designs, some might be generic to certain design classes (e.g. DSP, CPU, network router, etc) and some might be totally specific to the design under test. Coverage like this would only be defined in terms of input stimulus, though state could also be observed and added into the mix. The advantage of only specifying things like this is that a tool could take these descriptions and generate tests from them.

The task of defining coverage would be to either select the appropriate elements out of coverage libraries (for the generic ones), or to build the needed ones (for the purely specific ones).

Again, this is just a theoretical question. I’m not saying we should do away with other forms of coverage or anything. I’m just curious how far we can go in each direction.

Loading...

Reply
1. Yoav Hollander says:
  
  December 27, 2016 at 4:18 pm
  
  Hi again
  
  I like your idea about a hierarchy of coverage definitions.
  
  I think that for complex stuff (e.g. for the five kinds of complex DUTs I listed at the top, like AVs-in-traffic), there is in general no deterministic mapping from fine-grained behavior coverage back to input coverage. Let me try and convince you of that:
  
  Take our AV, and assume the inputs are road topology and the behavior of the other “players” on the road (cars, people, ducks), all of which are completely under VE control. The AV itself is made of, say, many millions of lines of SW + some ML + mechanics + lots more. There is no way to create a precise-enough model of that (and all its “analog” interaction with the road etc.), and apply model-based-testing to reach arbitrary “points” in the DUT (like code coverage in those SW modules).
  
  Yes, under some simplifying conditions (no other car, perfect visibility, …) we can use our simple model (and a SAT solver) to force some specific macro behavior. That’s useful, but most bugs will remain hidden under those simplifying conditions. For the full, complex situation we must assume non-determinism (even though repeatability is still assured in a “good” simulation environment).
  
  You can probably still reach those internal coverage points, in principle, if you run enough random combinations – but you can’t aim well. And this is where the various kinds of maximization may help.
  
  Loading...
  
  Reply
  1. Tudor Timi says:
    
    December 28, 2016 at 6:53 am
    
    I keep forgetting that the focus in this blog is larger and encompasses more than just chips. I guess a big reason why creating a precise enough model and doing pure model-based testing aren’t possible is also the fact that the specification only paints a picture with broad strokes (like impressionists do), whereas a design is much more detailed (similar to a Renaissance painting).
    
    A lot of the times, though, I struggle with the question: “If the goal is to sketch a portrait, does it matter if we paint the Mona Lisa instead of the Lady in Blue?”. More concretely, if the designer over-designed, do we need to spend effort on this and fill the extra implementation (i.e. code) coverage that gets created? I guess in this case filling it isn’t the thing to do, but maybe reviewing and deciding that it’s superfluous.
    
    Loading...
Yoav Hollander says:

December 28, 2016 at 9:07 am

I am sort of assuming that there is always more fine detail in the implementation (even if it is not over-designed). So in principle you should also cover lower-level implementations (and lower-level specs), since they could hide bugs (e.g. if a single requirement in the spec has a “normal” and an “optimized” implementation, as in my multiply op example). And ideally you should do that even in system-level tests (even if you did good subsystem verification) – there should be significant verification overlap between the two.

So this is back to verification price/performance – you may not have the resources to cover all those millions of lines of implementation code (at least not in full-system test), much like you avoid creating “too big” crosses in “normal” functional coverage. Security people, on the other hand, often use fuzzers to really go through all the code looking for vulnerabilities – they understandably have less tolerance for very rare security bugs (and they also have easier time since they usually don’t write checkers – they just wait for exceptions).

My hope with is that maximization will get us a lot more coverage performance for the same price, thus allowing us to be more ambitious.

Loading...

Reply
Silas Reeves McDermott says:

December 29, 2016 at 11:54 pm

I was recently asked the question, “Portable coverage is more important than portable stimulus, isn’t it?”

I still don’t know what my colleague means precisely by portable coverage, but his question got me thinking, and I look forward to sitting with him in another month or so to discuss it in detail. Regardless, Tudor’s post about the relationship between coverage and stimulus is very applicable to the discussion.

While coverage is the goal, we have always used input stimulus as the primary means to reach that goal. We exploit the fact that there is a direct connection between a specific stimulus and the coverage it produces. So, while we would like to take a collection of coverage from one revision, or platform, etc. and immediately assume that functionality has already been tested, we know better and run the same set of stimulus again.

In effect, portable stimulus is portable coverage. There will be times when a code or environment change causes different results, but this is one of the primary reasons we track coverage in the first place.

Also, regarding Tudor’s idea of higher and higher levels of coverage abstraction, I do think it may be possible within the scope of a single company, division, or technology area. ARM would be a prime example where such a thing could be possible.

PS: To be clear, most of my discussion here revolves around functional coverage.

Loading...

Reply
1. Yoav Hollander says:
  
  January 17, 2017 at 4:50 pm
  
  Hi Silas
  
  This is an important topic. I think portable stimulus is related to portable coverage, but they don’t always match.
  
  One issue is that (as you mention) the same stimulus may reach different coverage on different revisions / platforms, and that has to be tracked and acted upon.
  
  Another (more technical) issue under the “portable coverage” heading is that for portability, one needs a way to extract the same coverage information from different platforms with e.g. different visibility.
  
  The last issue is that ideally you would like to specify coverage goals in a portable way, but for some of them there is no direct mapping back to input coverage (due to non-determinism + platform differences). So you can’t aim at them directly (at least in some platforms). This brings back the issue of maximization.
  
  Loading...
  
  Reply

	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…
	Aman on DeepXplore and new ideas for v…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)