A companion post talks about spec bugs and ways to avoid them. Spec / requirement bugs (as opposed to implementation bugs) occur when the spec for a subsystem does not take into account some higher-level, full-system requirements.
Spec bugs are pretty important, so here I’ll talk about some common verification-related concepts, and (my interpretation of) how they relate to spec bugs. A common theme is that all these concepts claim to be absolute, but are really subsystem-relative.
Verification vs. Validation
Everybody swears by definition that verification is “Are you building the thing right” and validation is “Are you building the right thing”.
This is a valuable distinction for a specific subsystem, but (in any complex system) this is a subsystem-relative concept, not an absolute one (as in “one man’s ceiling is another man’s floor”).
Yes, checking that your shiny new radar display box shows the right, important things to the pilot is indeed validation for that box, but it is also part of the verification of the full airplane-flight-system, i.e. the system which includes all cockpit equipment, the (human) pilot’s procedures, and so on.
Edited 9-Aug-2015 to add: But see also this post.
Performance bugs vs. functional bugs
Performance bugs (or issues, but I call them “bugs” because presumably they were unexpected) are different from functional (“normal”) bugs. But this again is a subsystem-relative, not an absolute, concept:
- In highly-layered (e.g. communications) systems, a subsystem functional bug often manifests itself as a performance bug in some higher level (because some intermediate logic will simply retry until the information passed).
- A performance bug in a subsystem may cause malfunction of the bigger system in some scenario (e.g. your AV can’t have enough acceleration just when you need it to get out of trouble).
When people talk of “a conceptual bug”, they often they mean “a spec bug which was discovered late”.
Example: “OK Sue, this is a bad one: I found the bug but I can’t fix it – it is a conceptual bug. Somebody higher up the spec chain will have to re-spec, and this will cascade to a whole bunch of places. Or perhaps we’ll just never fix it.”
Correct by construction
As I have said before, when somebody says “X is correct by construction”, all they mean is “X was automatically generated from Y, which is hopefully shorter and easier to understand”. This is sometimes not the case – the Y notation may be shorter but no simpler (for the intended audience).
However, in general I am optimistic about creating design artifacts from higher-level notations. For instance, there are multiple projects (e.g. this one) for constructing a state-machine controller out of a bunch of temporal assertions (of the style “x should never happen after y” and so on). The language for specifying them is still cryptic, but it will get better, and one day we’ll have a practical translator which takes a bunch of reasonably-readable assertions and produces procedural code for our controller.
Does this mean there is no longer need for verification, because this is “correct by construction”? Hell no:
- You may not have to verify the translation (like you don’t usually check that GCC did the right thing with your C code), but often the translation is also based on various scripts and hint files, and thus the results still need to be checked. This is one reason why chip designers routinely verify (in simulation) that the netlist produced (automatically) from their higher-level design still does the right thing.
- You still need to verify that the assertions lead to the behavior you meant, and that you did not forget some just because you assumed the machine will implicitly understand them – a common mistake. For instance, one way for your robot to never do anything bad is to never do anything.
- But most of all, you still need to check that this automatically-produced subsystem works well within the bigger system.
Once you understand a bug, a natural question to ask is “what is the root cause of this bug” (people also talk about the root cause of an incident – they usually mean “the root cause of the bug causing this incident”).
The “root cause” question can have important legal and managerial implications, but in the context of engineering one can usually give it any number of arbitrary answers.
“The root cause of most aviation accidents is gravity” says an old proverb (which I invented, but Google led me to similar profound thoughts here).
Just to show how arbitrary the “root cause” assignment can be (and why it is usually subsystem-relative), here is the algorithm often used to do this assignment:
- Find what the problem is
- Examine the and-or tree of “things which might cause this problem to go away”
- Fix the problem by implementing the least disruptive sub-tree (usually one fully contained within the sub-spec owned by you)
- Assign the term “root cause” to the fact that this fix was not there originally