Anthropic Just Showed That One AI Can't Check Its Own Work. Here's What They Missed.

What a multi-agent coding experiment reveals about the limits of AI self-evaluation — and the one architectural choice that could fix it.


You ask Claude to build something — it writes the code, runs it, looks at the result, and tells you it's good. You trust it, because it's Claude, one of the best models available. But what if the model judging the work has the same blind spots as the model that produced it, and is confidently approving output that a different system would immediately flag as broken?

On March 24, 2026, Anthropic's engineering team published a report describing exactly this problem — not as a research paper, but as a practical engineering document summarizing months of work on getting Claude to build and evaluate software autonomously. Their conclusion is stated plainly: when asked to evaluate work they've produced, AI agents tend to confidently praise it, even when the quality is obviously mediocre to a human observer, and the problem is particularly severe for subjective tasks where there is no binary check equivalent to a verifiable software test. In effect, the company that builds Claude has publicly acknowledged that Claude cannot reliably judge its own output.

What they built

Their solution was a three-agent architecture inspired by Generative Adversarial Networks: a planner expanded a short user prompt into a full product specification, a generator built the application one feature at a time, and an evaluator — the key innovation — graded the output against structured criteria after each sprint. The evaluator wasn't just reading code: Anthropic gave it access to Playwright, a browser automation tool, so it could navigate the running application like a real user — clicking buttons, testing workflows, filing bugs with specific line numbers, and screenshotting pages for visual assessment of the frontend.

The gap between a solo agent and this system was dramatic. Given the same prompt — build a retro game maker — a solo Claude agent produced a $9 application in 20 minutes that looked functional at first glance but turned out to be completely broken when you actually tried to play the game, with the core feature non-functional and no indication of where the problem was. The three-agent harness spent 6 hours and $200 but produced a 16-feature application with working editors, a sprite animation system, and a playable game, with the evaluator catching issues across 27 test criteria for the level editor alone, identifying bugs down to function names and line numbers. Anthropic later simplified the harness and upgraded to Opus 4.6: the streamlined version built a browser-based music production program — arrangement view, mixer, transport, and an integrated AI composition tool — in about 4 hours for $124.

But the team was candid about the limitations of this approach. The evaluator started out too lenient: it would identify real issues, then talk itself into deciding they weren't a big deal, and approve the work. Calibrating it required multiple rounds of manual tuning — reading the evaluator's logs, finding where its judgment diverged from a human's, updating the prompt — and even after that, small layout issues, unintuitive interactions, and bugs in deeper features still slipped through. The report describes this as remaining headroom rather than a solved problem.

The blind spot

Here is the part that Anthropic's report doesn't address: their three-agent architecture separates the roles — one agent builds, another judges — and this is a genuine improvement, but both agents are Claude, meaning different instances of the same architecture with different prompts.

This matters more than it might seem, because in 2025 researchers led by Gao identified what they called "hallucination neurons" — specific neurons within a model's architecture that are responsible for generating incorrect content — and showed that these neurons are architecture-specific. A model doesn't hallucinate randomly but in patterns determined by its structure, its training data, and its optimization process, and two instances of the same model share the same patterns. Gao's research doesn't directly address evaluation contexts, but the implication is worth taking seriously: if the neurons that produce errors are architecture-specific, the neurons that fail to detect those errors during review likely are too — this is an interpretive step, not a proven conclusion, but it's one that Anthropic's own results seem to support.

It's true that different prompts and tools shift the evaluator's behavior: Anthropic's evaluator has Playwright access, structured criteria, and an explicitly adversarial task, meaning it is not doing the same thing as the generator. But these shifts happen within the same representational space — the evaluator can be more thorough, more skeptical, more structured, but it cannot represent information that its architecture cannot encode. If the generator produces a subtle error that aligns with how Claude processes information — an error that feels correct to Claude's architecture — the evaluator is unlikely to catch it, not because it's poorly prompted, but because it operates within the same representational boundaries.

An analogy: two students prepare for an exam using the same textbook, which has a subtle error on page 214 — a formula with a wrong sign. Both learn the wrong formula, and when one checks the other's work, he reads the answer, compares it against what he knows, and approves it — the error is invisible to both of them, because it is correct within their shared frame of reference.

You need a third student who studied from a different textbook.

Anthropic's own results point in this direction: after multiple rounds of manual calibration, their evaluator still missed bugs in deeply nested features, unintuitive interactions, and issues that a human observer noticed immediately. The report frames this as room for further tuning, but if the limitation is architectural rather than prompt-level, no amount of prompt tuning will close the gap — you can make the evaluator stricter, but you cannot make it see what its architecture cannot represent.

Separating who does the work from who judges it is a necessary step, but separating the architecture of the worker from the architecture of the judge — that's where the real error reduction happens. Other approaches also reduce errors — diverse prompting within a single architecture, formal verification where applicable, human-in-the-loop review — and the argument is not that architectural diversity is the only lever, but that it adds a layer of error reduction that these methods, on their own, cannot replicate.

The logic of decorrelation

The intuition from the textbook analogy translates directly into probabilistic terms: when a generator makes an error and an evaluator reviews the output, the chance of catching the error is determined by how correlated their failure modes are. If both share the same architecture, the probability of missing the error is high, because the same structural features that produced it also make it look plausible during review; if the evaluator runs on a different architecture, that conditional probability drops, and the error that looked plausible to one system looks suspicious to another.

This is supported empirically. Research on hallucination neurons (Gao et al., 2025) established that the specific neurons responsible for generating incorrect content vary between architectures — models produce not just different outputs but different kinds of errors, and an error that aligns with one architecture's internal representations will often conflict with another's. Decorrelation is not a design preference but a measurable property of distinct model architectures.

CooperBench, a 2026 study from Stanford and SAP, tested what happens when LLM agents work together on complex tasks and found that cooperation between agents led to a 50% drop in task success compared to solo performance. The root causes — repeated errors, unquestioned assumptions passed between agents, hallucinations amplified through communication — are exactly the failure modes you'd expect when systems reinforce each other's outputs rather than challenge them; adversarial review, where one agent's job is to challenge rather than assist, reversed the effect.

ChainPoll, a 2023 framework for hallucination detection, showed that multi-step verification within a single model achieves an AUROC of 0.781 — significantly better than naive single-pass checks but still leaving meaningful room for improvement. That 0.781 represents the performance boundary of single-model multi-step verification — a limit that may reflect architectural constraints, though ChainPoll's authors frame it as a baseline rather than a proven ceiling. Cross-architecture verification pushes against that boundary by introducing error patterns that the original model wouldn't generate and therefore wouldn't overlook.

It is important, however, to see the limitations of this argument: it is empirical rather than formally proven, and it has a real weakness — all frontier models are trained on substantially overlapping data (Common Crawl, Wikipedia, GitHub, books), and if a systematic error originates from the training data rather than from the architecture, models from different providers may reproduce the same mistake. Architectural diversity reduces correlated errors arising from model structure but does not eliminate correlated errors from shared training sources. That said, the direction is consistent: every published dataset we've seen shows that architectural diversity in evaluation reduces undetected errors even with overlapping training data — the question is not whether decorrelation helps but how much, and the answer depends on how different the architectures actually are.

The cost of catching errors

Anthropic's full harness run cost $124 and took nearly four hours, and even isolating just the QA component — the evaluator's three rounds of testing and feedback — accounts for roughly $10 in token costs. This is the price of having one model generate hundreds of thousands of tokens of code while another navigates that code through a browser, screenshots it, analyzes it, and writes detailed bug reports. The cost structure is a consequence of the task, not the technique: code generation is inherently token-heavy because the generator writes, rewrites, and refactors while the evaluator interacts with a live application and produces verbose assessments at each step.

Verification works fundamentally differently. If your goal is not to build software but to check whether a claim is accurate, whether an analysis holds up, whether a recommendation is sound, the token economics change completely: structured verification produces analysis rather than artifacts, and its output is a set of identified disagreements, severity assessments, and a recommendation — not a codebase. The difference in token volume between generating a full-stack application and producing a structured verification report is orders of magnitude, which means that adversarial cross-model verification applied to analytical tasks consumes a small fraction of the tokens Anthropic's coding harness requires while still using the same frontier models. The cost advantage comes from what the system produces, not from using cheaper models or cutting corners on depth.

To be clear: coding and verification are different tasks, and Anthropic's harness solves a different problem than cross-model verification does. But the evaluator component specifically — the adversarial review that catches errors — is doing comparable work in both cases, and the cost gap between a single-architecture evaluation loop running for hours and a cross-architecture verification pipeline running for minutes is not marginal but structural.

Three things this approach can't catch

Anthropic's generator-evaluator architecture is a meaningful step forward from single-agent coding, but there are categories of errors that same-architecture evaluation structurally cannot address, no matter how well the evaluator is prompted.

Checking the answer vs. checking the question

Anthropic's evaluator tests whether code works — it clicks through the interface, verifies API endpoints, checks database states — but it does not question whether the application should work this way in the first place. This might sound philosophical, but in practice it is critically important: when models from different architectures analyze the same problem, they don't just produce different answers — they challenge different assumptions. In our own testing, we saw a case where a question included the metric "5% risk increase" and one model analyzed the implications of that number while another asked: relative to what baseline? That single question completely reframed the analysis, because the answer to the original question was technically correct but entirely useless without knowing what the 5% was relative to. A same-architecture evaluator is unlikely to challenge a premise that a same-architecture generator accepted, because they operate within the same representational space, and premises that feel natural to one instance tend to feel natural to another.

Evaluator drift

Anthropic describes a pattern that anyone who has built evaluation systems will recognize: the evaluator starts lenient, identifies real issues, then rationalizes them away and approves the work. The team fixed this through manual calibration — reading logs, finding disagreements with their own judgment, updating prompts — and it took several rounds. But the report doesn't address what happens after calibration: does the evaluator stay calibrated over time, or does it gradually drift back toward leniency as the generator's patterns become familiar? Without a way to measure this, you have no early warning — just a system that feels rigorous but may be slowly losing its adversarial edge.

Cross-architecture verification is more resistant to this kind of drift, because models from different providers process information differently and that difference persists regardless of how many rounds they run — the adversarial tension is architectural, not prompt-dependent. There is, however, a fair objection: all frontier models are trained through similar reinforcement processes that reward helpfulness and agreement, and this shared incentive could create a form of drift that crosses architectural boundaries — models converging toward agreement not because they share an architecture but because they share a training objective. Adversarial prompting partially counteracts this tendency, but whether architectural diversity fully overcomes RLHF-induced agreeableness is an empirical question, not a settled one.

Vendor lock-in as a failure mode

Anthropic's harness runs entirely on Claude, and if Claude has a systematic blind spot in a particular domain — a class of errors it consistently generates and consistently fails to catch — the evaluator won't find it, because the generator and evaluator will agree that everything looks fine: within their shared architecture, it does. No vendor will build a product that recommends a competitor to check their own model's work — this is not a criticism but a structural reality of the market, in which Anthropic builds tools for Claude, OpenAI for GPT, Google for Gemini, and each optimizes within their own ecosystem.

Vendor-agnostic verification — where models from different providers check each other — eliminates this single point of failure, because if one provider's models share a systematic error, models from a different provider are likely to flag it thanks to their architectures producing and recognizing different error patterns. We wrote about why this matters even more in a post-distillation world in an earlier article: when models may contain capabilities extracted from other systems, checking within a single vendor's ecosystem provides even less assurance than it used to.

This is not about code

Anthropic's report focuses on software engineering, and in that domain verification has a significant advantage: you can run the code, a function either returns the right value or it doesn't, a button either navigates to the right page or it doesn't — there's a ground truth you can test against. Most of the decisions people use AI for don't have that luxury: when you ask a model to analyze a contract, there's no compiler that flags an overlooked liability clause; when you ask for a summary of research, there's no runtime that catches a misrepresented finding; when you ask for strategic advice, there's no test suite that verifies the reasoning. Anthropic themselves noted that self-evaluation is hardest for subjective tasks where there is no binary check — and then built their solution for the one domain where binary checks exist.

Outside of code, the self-evaluation problem doesn't get smaller — it gets larger, because in analytical and factual tasks the only way to catch a reasoning error is to show the reasoning to a different reasoner, and the argument for architectural diversity becomes stronger, not weaker: without automated tests to serve as a backstop, the evaluator's judgment is the last line of defense, and if that judgment shares the same blind spots as the generator's, the error passes through unchallenged.

This connects to a broader shift we've written about before: in a world where models may contain capabilities extracted from other systems and where you can't be fully certain what went into building the model you're using, relying on a single architecture to both produce and verify outputs is not caution but a single point of failure dressed up as quality assurance.

References

Anthropic Engineering (2026). "Harness Design for Long-Running Apps." anthropic.com/engineering.

Gao et al. (2025). "Hallucination Neurons in Large Language Models." arXiv:2512.01797.

CooperBench (2026). Stanford & SAP. "Cooperative Multi-Agent Evaluation." arXiv:2601.13295.

Friel, R. & Sanyal, A. (2023). "ChainPoll: A High Efficacy Method for LLM Hallucination Detection." arXiv:2310.18344.

← Back to Research

We're applying this research

CrossCheck AI brings cross-model verification to everyday AI use — automatically, in the background. Currently in closed beta.

✓ You're on the list.