This blog post introduces the V&V method – please read the full 26-page paper here.
Summary: AGI may arrive in the near future, and without first solving the alignment problem, this could be catastrophic. The V&V method is a concrete, practical framework for deploying AGI in a safer way. It is based on how complex, AI-heavy, safety-critical systems like Autonomous Vehicles (AVs) are already verified today.
The basic idea is that for most complex tasks, rather than telling the AGI “Do X”, we’ll tell it “Build and verify a machine-for-X”. The verification process, rooted in proven engineering practices, is designed to combat the growing risk of ‘specification bugs’ and unknowns in AI-designed systems we cannot fully understand.
The V&V (Verification and Validation) method will not solve alignment on its own (see crucial prerequisites below), but it will complement other proposals (e.g. Constitutional AI), add a safety layer to many alignment solutions, and perhaps buy us some time.
Examples of such a machine-for-X – essentially a bounded, non-self-improving system designed to accomplish X – are AVs, a “machine” to cure cancer, and so on. Crucially, the machine-for-X is distinct from the AGI itself, and cannot radically self-improve – you need the AGI to design and verify an improved version.
Verifying the machine-for-X will involve AGIs and humans, and will include the creation (and critiquing) of a safety case. V&V will be done using scenario-based, coverage-driven methods, and will improve over time (as AI improves). Humans (aided by transparent evidence from the V&V process) retain final decision authority over deployment.
Crucial prerequisites and limits:
Aligned obedience is unsolved: We still need a reliable mechanism to make the AGI stick to the V&V loop and accept human vetoes, including (a) an incentive structure for the design, V&V and critique agents, and (b) a cure for multi‑agent collusion. In other words, the V&V method only works if the AGI is constrained to follow the process and accept human vetoes – an open problem outside the scope of this proposal.
Some cases may force us to run only partial V&V: Including (a) cases where you cannot separate the machine from the AGI – we may decide not to do them, and (b) urgent / competitive situations – we may decide to take dangerous calculated gambles.
The diagram below shows the main steps of the V&V method:
The rest of the proposal is structured as follows:
Chapter 1 describes the V&V method, emphasizes the crucial role of (completely virtual) simulations, and explains how it finds spec bugs and how it handles unknowns.
Chapter 2 explains why I expect the V&V method to be used as we transition to AGI (regardless of alignment considerations): Mainly because that’s how we currently verify complex AI-based systems (e.g. AVs), and I expect ever-improving “AI workers” (and then AGI) to simply slot into this process. It then explains why a human bottleneck is actually desirable. It suggests why alignment people should seriously consider the V&V method even if it does not really “solve alignment”. Finally, it talks about inner alignment.
Chapter 3 positions the V&V method in the context of Scalable Oversight approaches. For instance, it explains how it complements Constitutional AI (mainly by helping it avoid model collapse), and how it complements Iterated Distillation and Amplification (by adding holistic, full-system checking, and an efficient closed-loop mechanism).
Chapter 4 talks about the reward hacking problem. It claims that the V&V method will help solve the simpler “specification gaming” problem, but admits there is currently no solution for the more serious “strategic deception”.
Chapter 5 discusses challenges regarding the method’s implementation. It answers some of them, but admits there is currently no full solution for creating the right AGI incentives. In particular, multi-agent collusion (if not solved by some other means) could be fatal for the whole proposal.
Chapter 6 discusses challenges regarding the method’s basic assumptions. For instance, it admits that some AGI requests do not correspond to machines (and thus cannot use this method). It further admits that competitive pressures may sometimes prevent using the method fully (but claims the method will still help).
Chapter 7 talks about other related ideas in alignment research. For instance, it compares the V&V method to approaches based on formal verification.
Chapter 8 concludes by summarizing what the V&V method offers, and what it does not. It then calls for comments and feedback: Especially feedback on improving the proposal, and connecting it with other alignment approaches.
Please read the full 26-page paper here.
