Summary: This post tries to fit Fault Tree Analysis and CDV-based “normal” bug-finding into the same conceptual framework.
- FTA (and similar techniques) is a good, established way to reason about failures and reliability, but (1) it depends on humans to know all the things which could go wrong, and (2) it does not work well for complex SW (or even HW) bugs
- CDV (and similar smart-random techniques for finding bugs and collecting coverage) is good for finding bugs-you-don’t-know-about, but (1) lack some needed FTA-related concepts (like fault containment regions) and (2) can’t give “reliability numbers” (like faults-per-hour).
- FTA and CDV intersect in many places, but there is no good conceptual framework (that I know of) which will tell you when to use which, and how to combine them
This post will try to jump into these treacherous waters, and investigate some ways of combining the two. It sort-of continues the theme of my checking probabilistic components post, which also talked about connecting “plain” verification to more probabilistic stuff.
This (longish) post will proceed as follows:
- What is FTA, why it has issues, but why we need it
- Fault Containment Regions and why they are important
- Can we estimate SW failure rates?
- The various meanings of “fault injection”
- The importance of injecting high-level faults
- How to combine FTA and CDV
- Pros and cons of combining the two
FTA is a good idea with lots of problems
Quoting from a previous post:
There is the useful, standard notation of fault trees. They essentially describe an and/or tree of “things which could go wrong”, together with their known failure probabilities (there is also a sequencing node, called “priority and”). Given such a tree, you can compute the failure probability of the root of the tree (signifying e.g. “airplane falls out of the sky”) and check that it is no more than, say, once in 10**9 hours.
This is a good notation, but it is too simplistic:
- Fault trees are a good way to reason about the probabilities of problems we understand very well, but (by definition) we don’t understand bugs until we have captured them.
- Thus, fault trees don’t really work for SW: All attempts to compute “the failure probability for a piece of SW” are somewhat pathetic – e.g. the Ariane 5 bugwas just sitting there waiting for the maiden flight (with a probability of 1), but fault-tree analysis was completely irrelevant for it.
So fault trees are very good at what they do, but are not enough. So ideally we should be able to invent a notation which (among other things) includes fault trees, or can be easily translated to and from fault trees.
Of course, you could argue that FTA is always relevant, using the following (somewhat torturous) reasoning: Suppose that, unrelated to the current fault tree, we used CDV and discovered a brand-new SW bug. We could then either:
- Fix it, in which case the fault tree stays unchanged (because the bug disappears)
- Document it in the fault tree (for now?), as another node, and connect it correctly into the bigger fault and/or tree
If you claimed the above, you would be technically correct, but I am not sure this is helpful. Process-wise, you are looking for complex bugs using other techniques, and retrofitting them into the fault tree just to say “oh it’s all in the fault tree” feels somewhat artificial to me.
Simplifying a bit, FTA originated during simper times, when mechanical / chemical engineers could think of all possible issues, and just needed a tool to compute the resulting failure frequency.
To people (like yours truly) who come from CDV (or fuzzing) and who have spent a lifetime working on automated techniques for finding bugs nobody thought of, the idea that you can just “whiteboard your way to a bug-free world” obviously sounds like a cruel joke.
So it is tempting to just say “FTA is completely irrelevant to SW/logic bugs”, and move on. Tempting but wrong:
- Some system components (e.g. mechanical ones) are best analyzed via something like FTA, and those components interact with the SW/logic components to create the full system
- Some components / algorithms (e.g. those based on machine learning) are probabilistic in nature, i.e. they return the correct answer just most of the time (see this whole post about probabilistic components). So one needs to use a fault-tree-like notation to reason about how their errors (and the error probabilities) combine.
- Also, some FTA concepts – like Failure Containment Regions (FCRs) – are indispensable, as we’ll see shortly. And FCRs are often made of both HW and SW.
FCRs are an important concept
I have mentioned before Philip Koopman et al. of CMU (and their excellent article Challenges in Autonomous Vehicle Testing and Validation). I’ll use that article (and Koopman’s blog) as a starting point for some of the discussion here.
His one-hour video about the Toyota unintended acceleration case (in this blog post) is a really a good introduction to a whole bunch of things, like safety litigation, the MISRA standard, failure containment, automotive bugs, memory faults and so on. Many of these have specific blog posts, but the video brings the big picture.
One point he clarifies well is this whole failure containment stuff (slides 28..35). The idea is that hardware failures (e.g. memory bit flips due to cosmic rays), as well as unknown SW errors, could corrupt the whole processor (especially if there is no memory protection), and thus you need to construct a “failure containment region” (FCR) around the whole processor.
In the Toyota example, one FCR was the main CPU for throttle control (with about 250k lines of C code), and another was a monitor CPU whose job was to check that the main CPU as a fail-safe mechanism. Doing throttle control and monitoring on the same CPU (or sharing logic components, as Toyota actually did in this case – see slide 34) would be bad, since it would create a single point of failure.
So the concept of FCRs (also called “safety islands”), and how they fit into a fault tree, is indispensable for building high-integrity systems. But, like this whole FTA stuff, it has the following issue that you can’t really compute the failure rate for that throttle control subsystem: You can compute the HW failure rate, but not the SW failure rate.
Estimating SW failure rates
In the quoted post above I said: ‘All attempts to compute “the failure probability for a piece of SW” are somewhat pathetic’. However:
- Clearly there is a strong need to estimate SW failure rates
- A lot of research has been done on it (e.g. the Wikipedia entry for SW reliability testing)
So I decided to dig a bit more into it. My conclusion after all that digging is “Maybe”: There does not seem to be a practical tool / methodology to really estimate SW failure rates (in a way that you could really plug into fault tree computations), but perhaps with research we could eventually get somewhere.
Here is how I got to that (somewhat fuzzy) conclusion:
Reliable sources tell me that this was actually successful in some cases, using Expected Distribution (ED), i.e. running a HW+SW system using the input distributions we expect in real life.
So, if we ran our AV using ED long enough, we could perhaps get the required total failure rate (HW + SW + mechanics + etc.). However, as the folks from RAND (and others) have been saying, this much driving (or even simulated driving) is impractical, certainly if we need to redo it for each release of our AV SW.
So, to do the testing in realistic time, we need to go back to CDV-style stress-the-corner-cases testing. This gets you “accelerated logical aging”, but can we extract ED fault rates from it?
In the chapter “Assessing the frequency of Boolean bugs” of the above-mentioned post I said I know of no way to do this failure-rate-recalibration automatically. But you can perhaps do it with a lot of manual work. For instance, once the bug rate (with all your CDV tricks) goes below a certain threshold, take the last two-weeks-worth of bugs, and for each of them try to analyze how often the conditions leading to it appear in real life. Aggregate this over all those bugs, multiply by some healthy factor, and perhaps this gives you the expected SW failure rate if you did no more testing.
OK, so this is why I now say “Maybe”. BTW, it does not seem like AV people are actually doing that. In AV-land, it currently seems like people are demanding higher-quality SW for more critical functions, and are even dictating processes for that, e.g. “You should do formal verification for ASIL-D (most critical) SW”. But I don’t know of people doing anything like the kind of computation I described above to estimate post-testing failure rate and plug it back into the fault tree. Please comment if you know otherwise.
In fact, it seems that when interpreting the related automotive standard (26262) regarding HW/SW safety, most people just test / simulate the results of cosmic-ray-style faults, and don’t try to estimate any SW or logical-HW bugs.
Fault injection is an often-confusing topic
A related, confusing topic is fault injection. Koopman’s blog post about robustness testing clarifies the difference between the following concepts, which sound similar but are “actually quite different:
- Mutation testing: Adding mutations to your source code to check the quality of your testing
- Classical fault injection testing: Flipping bits in your memory to see that your SW can handle cosmic rays etc.
- Robustness testing: Testing your module to see that the API it exports can handle illegal params (e.g. NULL pointers), because caller modules may have bugs or get corrupted
- Fuzzing: Robustness testing where the illegal / crazy input is not params, but rather an input stream (like stdin).
Take a look at that blog post: It is short, well written, and also describes their own robustness testing tool (Ballista).
Just for completeness, I’d like to add two more fault-injection-like techniques to Koopman’s four:
- Simple error injection: Injecting sensor errors, communication errors and so on – more-or-less-random errors which are bound to happen.
- Injecting high-level faults without simulating why they happen: This is similar to robustness testing, but simulates e.g. a whole subsystem failing in interesting ways.
Some comments about simple error injection:
Everybody is doing that: You know that e.g. your optical sensors are imperfect, so you randomize sensor errors according to some distribution.
This is a good idea in general, but has all kinds of subtle issues. For instance, you may assume that more error will cause the DUT to always behave worse, so just to be sure, you always inject more errors than expected in real life. This is mostly OK, but sometimes more error injection will mask a bug: For instance, suppose there is a bug in algorithm A, but it will never be detected because the system can detect the error rate and always switches to algorithm B above a certain error rate.
As for injecting high-level faults:
Injecting high-level faults
This is a really important one. Suppose you have a big system-of-systems, and you want to see what happens when one subsystem S fails (gives the wrong answer, gets disabled temporarily, or even gets completely disabled). In “classical, natural” CDV you should simulate / run until (because of some combination of events) S indeed fails.
But it also makes sense to sometimes just simulate the fact that “S failed for no known reason”. Why?
- Because we have not found a way for it to fail, but we are sure that, like everything else in life, it could fail (due to some yet-uncovered bug, or because a cosmic-ray-style bit-change will somehow manage to bring it down).
- Because simulating the (known) reasons for the failure is too time-consuming, or happens too infrequently. And simulating a combination of such failures, which is what you needs so as to cause the “real” problem (e.g. simulating S failing right after S’ has also failed) is really hard to achieve “naturally”.
Koopman et al. do talk about high-level faults in the “challenges” article. For instance, they say:
A promising approach to helping validate autonomy features is to perform fault injection at the level of abstraction of the component, as part of a strategy of attempting to falsify claims of safety. This involves not only simulating objects for primary sensor inputs, but also inserting exceptional conditions to test the robustness of the system (e.g., inserting invalid data into maps). The point of doing such fault injection is not to validate functionality, but rather to probe for weak spots that might be activated via unforeseen circumstances.
To me this is still part of good old verification / validation. In a big, complex system things fail all the time, and an essential part of verification is to check that the system still works reasonably after a partial failure. Note that it is well-known (see the start of this, and the first reference there) that a huge percent of all SW bugs are in error-handling code.
Thus, injecting high-level faults is an important tool. Why don’t people use it more? And why, when they do, this is mainly in directed one-off tests and not in CDV-style massive random testing?
At least one big technical issue has to do with checking. People naturally write their checks assuming things work. Disentangling your rat-nest of checks, and clarifying which checks should be ignored for each artificially-disabled component is a pain. To do it right, you need a lot of pre-planning and a solid methodology (which most people don’t have).
Combining FTA and CDV
So ideally we should combine the best features of FTA and CDV. But how?
I don’t know of any such effort, successful or otherwise. Some people tried to do related things, such as modeling abstract-faults-and-their-outcomes. One such effort is FFIP (pdf, by Kurtoglu and Tumer) – an intriguing simulation-based framework for injecting and propagating high-level faults. For instance, as a pipe transitions between the (abstract) states normal, fail_clogged and fail_leaked, its functionality changes accordingly, thus changing the functionality/state of downstream components / functions.
But this does not directly attack the problem of combining FTA and CDV. So here is my (incomplete) suggested direction: Have several fault trees connected “sideways” to the component tree.
Here is a crude depiction of this:
Note that the trees are distinct: The fault tree’s structure does not necessarily parallel that of the CDV component/function structure tree – they are just bound to touch in many places. For instance, the fault tree may have an “and” connecting (the failure of) two components which are widely-separated in the component tree.
However, sometimes the fault tree does follow (at least partially) the component tree. For instance, perhaps A fails if either A1 or A2 fail. In that case, the fault tree should “reach back” and emit a fault in A.
If we start from an existing fault tree, then each leaf fault in it needs to sit in some component. If it has a leaf node “pipe breaks”, we need to create a “pipe” component whose sole functionality is to emit “pipe breaks” using some predefined distribution.
Let’s take the Toyota injection-control example mentioned above (where the two trees do, more or less, overlap). Converting it into a (textual) fault tree, and using “I-CPU” for “Injection CPU”, we get something like:
(I-CPU failed) and (injection monitor failed) => injection out of control;
(injection out of control) or (gas pedal stuck in rug) or (…) => unintended acceleration;
(unintended acceleration) and (…) => accident;
Initially, we just connect to component tree according to this subtree. But now suppose we change “I-CPU failed” from a leaf node to an “or” node:
(I-CPU cosmic-ray bit-flip) or (I-CPU HW bug) or (I-CPU SW bug) or (I-CPU forced by test directive to fail for no reason) => I-CPU failed;
So the fault tree will now have an “or” get with four inputs, connected to the following four failure nodes:
- A node which is triggered randomly according to the known bit-flip error rate. We may need to create a component for that node.
- A node which is triggered whenever a checker of for the HW block fires
- A node which is triggered whenever a checker of for the SW block fires
- A node which is triggered directly (by some test-control scenario) to indicate “fail for no reason”
Note that while cosmic-ray bit flips have a known probability, the other components obviously don’t (unless we crack SW failure rate estimation) – bugs may simply be found (or not) during simulation.
Advantages of superimposing the FTA tree on the CDV tree
I am hoping connecting the two tree will buy us a bunch of good things:
- It connects the two notations conceptually, letting you use the right notation for each job
- It automatically propagates errors up, using the convenient fault-tree notation to express how they propagate (see the “reach out” discussion above)
- It lets you easily propagate the effect of cosmic-ray bit flips into higher-level (or even same-level) logic blocks
- It integrates high-level fault injection into this (via that last term of the “or”)
- It gives you an automatic hint of “what to stress”
- More generally, it lets you use this tree in a top-down manner to somehow plan and orchestrate scenarios which will cause higher nodes to fail
- It lets you enhance the fault tree with various “natural” extensions. For instance, system dynamics blocks are a very natural extension to fault trees (expressing dependence between nodes over time).
- If we somehow succeed in estimating SW failure rates (which, as I said above, is a distinct “Maybe”), then perhaps this mixed notation will help in rolling up the numbers.
Note that I have outlined here just a direction, not a fully-developed idea. In particular, I left many questions unanswered:
- How do we “mechanically” connect the two trees? For instance, do we instantiate standard “fault objects” in each verified sub-components, and those fault objects form the nodes of the fault tree?
- How do we integrate checking into this framework in an elegant way?
- How does this interplay with full-system vs. sub-system verification? Clearly, it is impractical to find all bugs in full-system simulation.
- Which fault-tree extensions should we add, and how?
- Should we use a single “failed” event as in classical FTA, or a more sophisticated failure-type enum with propagation rules, as in FFIP?
Comments are very welcome.
I’d like to thank Nir Beth Halachmi, Apurva Kalia, Kerstin Eder, Ron Kenett, Yaron Kashai, Thomas (Blake) French and Coby Hanoch for commenting on previous drafts of this post.