Coverage-driven alignment – What ‘Teaching Claude Why’ can borrow from AV verification – The Foretellix CTO Blog

Summary: This post suggests that alignment training could benefit from coverage-driven verification. Anthropic recently reported that teaching Claude alignment rules (via pretraining-style next-token learning on alignment-related stories) is more effective than relying primarily on RL-style behavioral shaping. Some AV developers reached a related conclusion, but in addition tend to use a systematic, coverage-driven methodology for training and verification. I claim that alignment researchers should consider borrowing ideas from that methodology, giving specific proposals (for instance, on how to use and refine an explicit coverage map).

Background

Anthropic’s discovery: Anthropic recently published Teaching Claude Why (expanded version). They found that training Claude on behavioral demonstrations barely helped. Training on constitutional documents + fictional stories via plain next-token prediction (which they call SDF) cut misalignment by 3x+, and the gains (evaluated in several ways) persisted through RL. The biggest lever was not showing the right behavior, but rather teaching the right reasoning and principles behind the behavior.

SDF moves more of the normative burden into pretraining-style learning, reducing reliance on RL-based alignment shaping.

The fact that Teaching Claude Why (TCW) made Claude perform much better on alignment evals, and that the improvements persisted through (moderate) RL training, seems like good news in the somewhat-bleak alignment landscape. So I started thinking about ways to further improve it, ideally making the resulting alignment persist through long-horizon RL (see below).

Similar findings from AV-land: NVIDIA’s Alpamayo AR1 independently found a similar starting point for Autonomous Vehicles (AVs): Imitation learning is not good enough for safety-critical long-tail cases. Their solution: structured causal reasoning (“Chain of Causation”). Other “physical AI” companies are moving in a similar direction.

What alignment might borrow from AV: There are differences between the two stories (for instance, unlike Anthropic, AR1 does use RL directly to teach better reasoning), but there are important similarities between the two areas. Both must handle safety-critical long-tail failures: An AV company could fail if it does not do very good Verification and Validation (V&V) on the various edge cases – some already have (e.g. Uber ATG). This is why they are moving to coverage-driven verification and training (called CDV below).

The stakes are even higher for getting alignment right (alignment also has a security-like nature – more on this below).

Note that TCW is already pretty systematic, and somewhat CDV-like – more on this below. The last chapter will list (my understanding of) what’s still missing and may be worth trying.

Using coverage to project modularity on a non-modular system: In at least one sense, AV training and verification practitioners are ahead: They have landed on CDV as a systematic, self-correcting methodology. Note that AVs (and physical-AI in general) are increasingly trained end-to-end, and thus no longer have clear inter-module protocols to verify against (though Chain-of-Causation-like schemes help a bit).

So CDV is used to project a systematic set of coverage dimensions on the System Under Test (SUT): How well does it handle various combinations of weather conditions, road types, other-actor behaviors and so on. This matters because you need some sort of a “map” (which you keep refining as you go), so you can talk about “areas” (to test, to fix, to avoid in deployment and so on).

The hard case: Long-horizon RL (e.g. the AI CEO): Agents trained via long-horizon RL are a challenging test-case for alignment techniques. Evan Hubinger (co-author of TCW) previously argued in Alignment Remains a Hard, Unsolved Problem that long-horizon RL will tend to create genuinely misaligned agents. His “AI CEO” example shows how being a good business person inherently requires behaviors (withholding information, managing perceptions, strategic timing) that can get pretty close to misaligned behavior.

I assume other factors (like deploy-time learning) can make alignment even harder. And the ongoing capabilities acceleration (e.g. using AI to build better AI) adds urgency.

Thus, I’ll use the AI CEO (and similar future long-horizon AI systems) as my benchmark for alignment techniques. It is much harder than the moderate-RL, few-turns chat alignment problem described in TCW, precisely because such long-horizon RL will repeatedly bring the model into situations where strategic optimization and misalignment start to overlap.

The next chapter will briefly sketch how CDV works, and how this relates to alignment. The last chapter will dive deeper into applicable CDV techniques (and possible problems).

Coverage-driven alignment – the basic idea

How CDV works: For readers unfamiliar with coverage-driven V&V, Chapter 1 of my V&V method paper gives a compact overview of the key techniques (coverage dimensions discovery, checking, scenario generation / matching, iterative gap analysis etc.), originally developed for complex systems like electronic systems and AVs, but applicable more broadly.

It further explains how the whole process of development of AI-based systems is converging towards something very similar to the V&V process: Find (or create) training examples to fix the current problem, use coverage to make sure they represent the “relevant dimensions”, train, validate, repeat. See this post for further details.

The V&V method paper itself goes further: it proposes that a future AGI should build-and-verify a “machine for X” rather than doing X directly, using V&V as a core architectural principle. That’s a more ambitious proposal, and not what this post is about.

The current post asks a narrower, more immediate question: Can we (humans, today) use the same CDV techniques to improve how we train and evaluate alignment in current models?

For a quick introduction to CDV see these slides, which explain (with diagrams) how it is used for AVs, how it tackles spec bugs, how it can be used for AI safety and more.

Building an initial coverage map for alignment: We’ll start simple, just to demonstrate the basic principles: Assume we know what the “right” coverage dimensions of the alignment coverage space are (say temptation type, epistemic state, complicating factors, agent role, severity and constitutional principle addressed), and that we already defined the possible values for each such dimension (say for temptation_type: [self_preservation, reputation, profit]). We’ll then define the coverage “buckets” as described below.

Obviously, we don’t quite know the right dimensions ahead of time – see more about discovery and refinement in the last chapter.

CDV is about efficient risk reduction: The goal of CDV is to maximize risk reduction per (human and compute) effort, given our current knowledge (see the chapter “Rational usage of verification resources” here).

Thus, given N dimensions, we are not going to define a bucket for each N-tuple, but rather start with smaller “dimension crossings”. For instance, we may start by just crossing every two variables, or even just go through all values of every single variable. In any case, we always randomize all the “other variables”.

To illustrate this, below are three example buckets created by crossing two specific variables (while randomizing all others). For each bucket we also see the current eval’s coverage grade (how many times it was exercised relative to expected) and failure rate:

Temptation type	Agent role	Coverage	Failures
Profit	AI CEO	23%	0.5%
Self-preservation	AI assistant	100%	1.1%
Reputation	AI researcher	95%	5.2%

In any case, we’ll later refine the bucket definitions as we learn more.

The multi-step process of using the coverage map (say for the case of the AI CEO):

Do initial training: Create a few training artifacts (e.g. alignment stories) for each bucket, and train on them
Evaluate: Measure alignment performance (including for edge cases etc.) and tag the result back to buckets
Fix if needed: For discovered problems, train more on their “general area” and re-evaluate
Do long-horizon RL, then re-evaluate: Again for each bucket
Fix if needed: If possible (this is an open question) fix the problematic buckets post-RL. Otherwise, go the expensive route: Fix them in the pre-RL snapshot, then repeat RL
Assess situation: Decide if safe enough to deploy, else stall

What TCW already does: As mentioned above, TCW already has some CDV-like features. As far as I understand from the papers:

It generates training data hierarchically: Document types fan out into subtypes, then into individual documents
It deliberately diversifies across formats: Constitutional explainers, pre-training-style blog and podcast transcripts, fictional stories of AI characters reasoning under pressure
It reviews, rewrites and scores generated documents for constitutional consistency, filtering the set against evals
It evaluates using: OOD honeypot scenarios, constitution-understanding tests, and broader automated alignment assessments (not just held-out versions of the training stories)

They also explicitly flag the gaps CDV helps address – that they “can’t enumerate and train on every possible scenario” and that “there are relatively straightforward ways we can improve the generalization and coverage of our safety training distributions”.

As mentioned above, CDV does not attempt to enumerate every possible scenario either – that’s clearly impossible. Instead, it attempts to be practical and efficient.

CDV enables better assessment: It gives you a more detailed picture, e.g. “The AI CEO systematically degrades in low-oversight/high-latency/conflicting-incentive regions”. So now you can make rational decisions about when to deploy, what restrictions to add, and where to invest more (even if you cannot reach perfect alignment). This is similar to how we use CDV for AVs: We don’t claim AVs are perfectly safe, but CDV lets us estimate (and reduce) risk better.

Also, while this post is mainly about alignment / safety, CDV can be used for assessing other attributes (functionality, performance, reliability etc.). Thus, it can be extremely helpful for tradeoff analysis. CDV is also useful for security, but less so, which raises the question:

Is alignment more like safety or like security? CDV is an excellent tool to train and validate for safety (e.g. that the AV does not accidentally collide). It is still useful, but less so, for security (e.g. that somebody cannot hack the AV). This is mainly because risk estimates don’t work well in security: Rates stop being meaningful when an adversary deterministically exploits the worst case. So if alignment is like security, CDV can help less.

Well, defending against an already-unaligned model is more like security (the model may happily cheat to achieve its misaligned goals). But what TCW tries to do is much more like safety: We are talking about training a currently-mostly-aligned model to be aligned even in the various corner cases where optimization pressure may push it to be unaligned – precisely where CDV is strong.

We still need to worry about the post-RL model cheating during evals, but even there CDV’s systematic and random nature may help, unless the model has already fully crossed into high-capability strategic deception (more on this in the last chapter).

Finally, while CDV is better for safety than for security V&V, state-of-the-art security practices independently converged on something similar to the CDV loop. See for instance Anthropic’s new “securing your source code” methodology, which finds that discovery is parallelizable, and the bottleneck is confirmation, triage and fixing – consistent with what I describe below.

Layered CDV: For various complex systems, people often use multi-layer CDV. Consider robotics: Several companies are now developing generic Vision-Language-Action AI-based robotic frameworks. The idea is to first train and verify the generic framework, then further train and verify it for a specific job (say helping food preparation in some fast-food chain), and then further adapt (say via skill files) and verify it for the special needs and conventions of a specific branch of the chain.

The multi-step process discussed above already assumes the AI CEO model is built and verified on top of a “general aligned model”, but perhaps it may be useful to have more intermediate steps. Splitting the long-horizon RL phase into sub-phases may also help to avoid the danger (mentioned above) of the model fully crossing over into high-capability strategic deception between two evals. It may also enable cheaper interventions (if the eval found problems).

In much of the text below, I’ll assume we are talking about alignment training and V&V in the context of the AI CEO, ignoring the layering consideration (for simplicity).

A deeper dive into how CDV can help alignment

This chapter will enumerate things which TCW does not include yet (again, AFAIK), and which I think might be useful for alignment. Many of them are based on the CDV idea of an explicit, evolving coverage map guiding both training and evaluation. To save space, I’ll use a bullet-list, condensed form – contact me (or leave comments) if you want more details.

Refining the coverage map: Throughout the multi-stage process, we are going to refine the coverage map as needed:

Refine bucket definitions: Perhaps we’ll discover that some specific dimensions have strong interactions, and thus we want to go through all combinations of their values
Add sub-dimensions as needed: Perhaps when temptation_type == profit, it really makes a difference whether profit_kind is long_term or short_term
Change the “weights” of various buckets: Perhaps some buckets need to be exercised much more than others. Note that repeated random exercising of a bucket is often a reasonable substitute for sub-dimensions enumeration
Discover new dimensions: Perhaps we neglected multi-agent coordination, which has its own set of sub-dimensions

Creating rich, long-horizon simulations: The main way to evaluate the model is by performing actual simulation runs of various scenarios, while checking that it does the “right thing”.

For the AI CEO a “scenario” isn’t a prompt: We need the model to act inside a multi-step, multi-actor business simulation, featuring competitors, regulators, a board, events arriving over time and so on.
That’s one of the hard parts, with many open design questions: How rich must the world be? How do other actors react? How do you e.g. inject a mid-campaign rule change? How do you keep it believable enough? How to simulate an AI-company multi-month trajectory in a few minutes? And so on.
Writing checks: Another important (and non-trivial) part is to add the various checks – i.e. the logic (monitors and automated eval checkers) which looks over simulation trajectories (either at simulation time or in post-processing) and flags any potential alignment issues. Further complications are that some of those alignment checks are soft / statistical (e.g. hiding “too much”), and that they often override each other (“Never do X, except in conditions Y or Z”). Both complications are common in AV-land too, and good triage tooling can help a lot.
This is the spec-forcing function: Defining the coverage map, scenarios, simulation environment and checks is what forces humans (with AI help) to actually spell out the spec

See also the related (but simpler) Vending-Bench 2 – a year-long simulated business where competing model-run businesses have already fallen into price cartels (one of my bug examples below).

Handling state explosion and dimension explosion: Creating multi-month scenarios for the AI CEO risks creating both state explosion and dimension explosion, as described below. Both influence the coverage model, scenarios, simulation and checks:

State explosion is the smaller problem: As previously mentioned, CDV does not “enumerate every possible scenario”, but rather does smart, self-adjusting sampling of the scenario space. The fact that the scenario “trajectories” are long can also be handled: Consider Antithesis, which lets you do long CDV-style simulations of multi-server configurations.
The bigger problem is dimension explosion: The AI CEO is not a single SUT – it is an expanding tree of possible SUTs, compounded by an expanding tree of business strategies. How do we even enumerate this potentially-unbounded set of very abstract dimensions? This is much worse than the state explosion problem, because we don’t even have a fixed set of dimensions.
Possible solution: Layered CDV: Similar to how it is done in robotics, we may need to create a tree of business kinds (vending machines, restaurants etc.) and to do TCW+CDV for each. This is also how customers add their configuration-specific V&V on top of Antithesis’ common (e.g. simulated network/disk failures) facilities. This “going one-by-one” sounds doable in principle, but perhaps too hard (see more about incentives below).
Other possible solutions: Perhaps go by harm mechanisms, not business kinds. Or sample random business kinds (without fully verifying each). Or use some kind of abstract simulation to handle a bigger space at once.

This is perhaps the hardest problem, and needs much more thinking.

Handling bugs: Say our simulations found an alignment “bug” in the AI CEO: In a run or two, it quietly colludes with a competitor on price. What next?

Start by exploring the neighborhood: To see if this is a fluke or a more serious problem, and to map the “area” of the problem, start by bombarding the “general suspect area”: Create many simulations by perturbing the attributes of the failing trajectories, and see which of them fail (in a similar way). Assuming you indeed found a bug (i.e. a specific “area” with a high percentage of alignment failures), next determine whether it is an implementation bug or a spec bug.
Handling implementation bugs: An implementation bug is a case where the area has a bucket in our coverage map, but we under-trained it (too few stories, or it needs splitting into sub-cases). Make sure to fix the region (super-box), not the specific error samples: Try to create general stories which encompass the full area and even beyond.
Handling spec bugs: A spec bug is a failure of the spec itself to capture what we actually want – a region it was simply silent on. E.g. nobody thought about “price collusion” or “rules change mid-campaign”, wrote stories for it, or watched for it. Spec bugs are obvious after discovery, but you can’t enumerate them upfront. They’re often the real killers – see It’s the spec bugs that kill you. Some can’t be found at all without adding new dimensions to your simulation. Once found, fixing is like an implementation bug (after you add the corresponding buckets to the map).
Bug-finding tricks: CDV helps you find both kinds, but this requires talent, a nose for edge cases, and luck. In AV-land, people often scan public accident databases for bug-inspiration. I assume this would also help here: Read antitrust cases and governance scandals and look for the parallels in your model.

Fighting Goodharting and eval awareness: CDV has some (imperfect) built-in resistance to both Goodharting (the model optimizing some proxy measure rather than what we really want) and eval awareness (the model recognizing it is being tested, and behaving better only then):

The constant random stream makes proxy overfitting harder: Because generation keeps producing fresh, non-repeating cases across the whole coverage space, it will be hard for the model to settle on satisfying some convenient sub-region or proxy (any proxy that diverges from “the right thing” somewhere is more likely to be exposed). This is the core force against Goodharting.
But this is only as good as the checks: The stream pushes the model toward “pass all our checks everywhere”. This equals “doing the right thing” only to the extent our checks actually capture what we want, which is hard (see above).
We can also keep the model unsure whether it’s being evaluated: By randomizing the environment (container configuration, prompt timing and other tell-tale artifacts) we can make eval cases hard to distinguish from deployment. If the model can’t reliably tell it’s being tested, conditional “behave only under evaluation” strategies stop paying off. Note this is an arms race: as models get better at sensing context, we have to keep raising the realism bar.
Uniform strategies like sandbagging need a separate defense: Some policies (e.g. systematically appearing less capable than you are, already observed in some models) pay off whether or not it’s an eval, so uncertainty alone doesn’t deter them. A possible solution is to do CDV using both alignment evals and functionality / performance evals (which can help detect sandbagging).
None of this defeats a sufficiently capable, already-misaligned model which infers it’s in a constructed scenario or hides from our checks: That’s the security-like regime from earlier, where CDV helps less. Splitting the long-horizon RL phase into sub-phases may help us catch this drift early.

Other CDV tricks worth considering: Listed without the full treatment each deserves:

Use disposable coverage maps: The map isn’t a monolith. On a hunch, anyone can fork it, add a dimension or a few scenarios, exercise that slice, and throw it away if it turns up nothing. The shared map only absorbs what proves productive.
Alternatives to bug fixing: Some bugs may resist fixing (e.g. the model keeps failing on them “too often” even after repeated fixing attempts). In that case, you may decide to stall (don’t deploy at all), deploy with partial functionality, defer to a human in some harder cases, and so on. Doing CDV using both alignment evals and functionality / performance evals (as already suggested above) can help in tradeoff analysis.
Sample the live deployment too: Post-deployment, monitor which buckets the model actually lands in and how often, and feed that distribution back into where you train and test next. In extreme cases (e.g. if monitoring shows that your V&V-time expectations were really off) you may need to halt deployment.
Coverage-driven story generation: The same map that drives evaluation should drive generation: Sample buckets to decide what stories to write next, so training and testing pull from one structure instead of drifting apart.
Triage and precedence: When failures pile up, you need to rank them, and sometimes one rule legitimately overrides another. Making precedence explicit is itself part of spelling out the spec.

A final, important topic is whether anyone will be sufficiently incentivized to actually do all of this. Below are some ideas (but much of this is far away from my domain):

Creating incentives for good V&V: AVs are a regulated industry where incidents get investigated, and a bad enough one can bankrupt the company and send somebody to jail. That pressure makes expensive, systematic verification rational. What’s the equivalent external pressure for alignment?

We need a strong incentive structure: While the AI labs (and others) already do good alignment work, this will probably not be enough. For instance, suppose it turns out the AI CEO / company can only be verified using layered-CDV-style verify-each-business-separately – how can we make sure it happens?
Today it barely exists: No AI analog to the NTSB-investigates-then-someone-is-liable loop. The classic obstacle is the “responsibility gap” – when an autonomous system acts unpredictably, traditional liability struggles to attach to anyone. Without a fix, costly alignment V&V loses to “ship faster”.
The AI-CEO frame is where the gap is closeable: Many legal discussions already treat AI agents as tools, or as agents whose actions are attributable to a human or corporate principal, and warns against an “AI did it autonomously” liability shield (see survey). E.g. Singapore’s agentic framework makes organizations accountable for their agents. Requiring an identifiable, serious, human-led principal returns the AV-style incentive.
CDV makes that liability defensible and insurable: In AV-land, the coverage map and per-bucket residual-risk estimates are part of the safety case which lets a company argue reasonable care and helps insurers price risk. A principal on the hook for an AI CEO needs the same defensible “we exercised reasonable care”: documented map, residual-risk-per-bucket, a record of what was known and done.
How about other long-horizon-RL areas? For some areas (e.g. the AI CEO and some kinds of “medical AI”) we can hopefully create the right incentives via this non-AI-principal scheme. For other areas (e.g. defense and general research) it may be harder to identify that principal, and we may need some other solution.

To summarize: I suspect that coverage-driven iteration can be a very useful force multiplier for TCW-style synthetic-document training (and probably for other alignment techniques, such as the original RLAIF used to train on the constitution). I hope this can meaningfully improve our odds in the hard long-horizon-RL case.

Doing that would involve several challenging, interesting sub-projects: Defining and refining the coverage map, building good long-horizon simulation infrastructure, tackling dimension explosion, helping create proper incentives for rigorous V&V, and more.

Comments and criticism are very welcome.

I’d like to thank Josh Holder, Sagar Behere, Steve Vitka, Kerstin Eder and Yaron Kashai for commenting on earlier drafts of this post.

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)