Site icon The Foretellix CTO Blog – AI safety

Machine Learning for Coverage Maximization

Summary: This post describes in general terms the problem of “using ML for coverage maximization”, explains why it is important for CDV and for fuzzing, and gives some references.

My first post in the “ML and verification” series talked about verifying ML-based systems (lots more to say about that). This post talks about the other side – using ML to help verification (of anything).

There are several ways in which ML can improve the verification process. Some examples are:

There are lots of other ideas (many related to the “big data” aspect of large verification projects), but I think coverage maximization may be the most promising, so the rest of this post is devoted to it.

Notes that this post is written as a stand-alone document and does not require any previous verification knowledge: It was originally written to hook ML people into applying their magic for coverage maximization. We’ll see how that goes.

Also, as is often the case in multi-disciplinary areas, terms like CDV and fuzzing do not have a totally-agreed-upon meaning. So whenever possible, I have added links to the terminology page (and to other posts) to serve as a dictionary.

I’ll try to explain below why Coverage Maximization via Machine Learning (let’s call it CMML) would be an excellent thing, if it worked. So this (very informal) post will go through the following:

[23-Dec-2016: Added the post Verification, coverage and maximization: The big picture, which indeed gives the bigger picture]

What are CDV and fuzzing

CDV (Coverage Driven Verification) is the leading technique for finding bugs in HW designs (though there are other techniques, like formal verification).

The goal of CDV is verification, i.e. finding as many important bugs as possible in the shortest time, while keeping track of where we have been using coverage – usually functional coverage (see below), which is mapped to a verification plan.

In CDV, we construct a Verification Environment (VE) whose job is to control and check the Device Under Test (DUT), which is running in some (usually simulated) execution platform. The DUT and VE together are called the DVE.

CDV is also starting to be used for verifying autonomous vehicles and other complex systems – see here and here.

Fuzzing is a similar technique used by security people to find vulnerabilities. The goal is to find as many vulnerabilities as possible in the shortest time, while keeping track of where we have been using various measures of coverage.

So, even though there are a lot of differences, fuzzing and CDV have a lot of similarities, and figure 1 can also represent fuzzing.

What is coverage maximization

Coverage maximization refers to the automated process of coverage filling, i.e. of reaching higher and higher coverage. This is also called coverage auto-filling, or Coverage Driven Generation (CDG, not to be confused with CDV). There is also the term Model Based Testing (MBT), which many people use to refer to the kind of coverage maximization which is based on formal model-based techniques.

Note that the process of just executing more and more runs will continue to (asymptotically) improve coverage, and may (perhaps, in a million years) fill all coverage. But coverage maximization is about a smart, automated, much faster way. Coverage maximization usually implies some feedback loop (see the back arrow in figure 1), where you maximize some, then collect current coverage and see what’s missing, then maximize some more.

It is important to understand that “coverage” is used here in the broadest sense, including both bucket-based “normal” coverage, and ever-increasing “pseudo-coverage”. Let me clarify that:

Filling coverage “manually” (by writing new tests) is very time-consuming, so a generic coverage maximization system would be very helpful.

Using ML for coverage maximization

 Using ML for coverage maximization has been tried before (see references below), but with limited success (often there was too much work to set it up per DVE). My hope is that the new advances in ML, plus some innovative thinking, will finally make this practical.

Here is one possible way to do it:

There are lots of ways to model this. My favorite is:

There are other possibilities, but I think this should be pretty good, and should generalize to many CDV and fuzzing systems. Some further comments:

 Some references

Here are some CDV-related coverage maximization efforts:

Here are some fuzzing-related maximization efforts:

I really hope some practical, general-purpose CMML tools will spring up. If you know of any, or interested in implementing one, feel free to drop me a line.

 

Exit mobile version