leek.io leek.io
contact
#ai #review #data

One LLM Isn't Enough: The Data

CJ
Chris Jones · @leek
Apr 24, 2026 · 9 min read · 1,646 words
One LLM Isn't Enough: The Data

A few weeks ago I wrote One LLM Isn’t Enough — the argument that running every commit through multiple models catches bugs no single model sees. It was based on gut feel after hundreds of commits. Several people asked the obvious next question: where’s the data?

Fair. I pulled it.

Across 217 commits that triggered an AI review issue, 1,514 findings were produced. Two models (Claude Opus 4.6 and GPT-5.4) reviewed every commit. For 91 of those commits, Gemini 3.1 Pro was in the rotation too. Here’s what the logs actually say.

The original claim held up. Sort of.

“When all three models independently flag the same line, it’s real. Every time.”

In the 3-model era, all three converged on the same finding 15 times out of 716 (2.1%). Every one of those 15 landed in a commit I went on to fix. The original claim is directionally correct — but three-way agreement is a rare event. You don’t get it on much.

The more interesting stat is what happens when the models don’t agree. In that same 3-model era, 87.6% of findings were flagged by exactly one model. Not two. One.

That’s not noise. That’s complementary coverage doing 9/10 of the work.

Who finds what

Model Total findings Critical findings
Claude 904 131
OpenAI 521 227
Gemini* 245 115

*Gemini only ran on 91 of 217 commits.

Claude is the loudest. It raises ~70% of all findings, but most of them are warnings or info — stylistic concerns, defense-in-depth nits, “consider this edge case.” Its critical count is low relative to its volume.

OpenAI is the most aggressive with the critical label. Roughly 44% of its findings get escalated to critical, vs ~15% for Claude. When OpenAI says something is critical, it’s usually a real runtime or security concern. When it cries wolf, it cries loud.

Gemini, in the short window it ran, was a surprise. It contributed only 34% of findings by volume in the 3-model era — but of its solo findings (flagged by nothing else), 93 were critical. That was more unique-critical contribution than Claude and OpenAI combined in that same era.

Quick sanity check: that’s not “Gemini is smarter than the others.” It’s “Gemini was wrong about different things.” Which is exactly what you want from a review team.

Different models fail differently. The whole point of running more than one is the orthogonality of their mistakes.

The pair-wise agreement table is the most useful artifact

Here’s the 3-model era broken down by agreement pattern:

Who flagged it Findings Critical
All three 15 5
Claude + OpenAI only 31 2
Claude + Gemini only 25 5
OpenAI + Gemini only 18 12
Claude alone 299 30
Gemini alone 187 93
OpenAI alone 141 48

Look at the OpenAI+Gemini pair. Those two disagreed with Claude 18 times — but 12 of those 18 were critical. That’s a 67% critical rate on a two-model pair that explicitly excludes Claude. A pattern I wasn’t expecting: on security-adjacent backend logic (queue dedup, duplicate-row races, encrypted settings handling), OpenAI and Gemini frequently aligned against Claude. Those were almost all real bugs.

Claude+OpenAI pair-only? 31 findings, only 2 critical. The agreement pattern that sounds strongest (the two “name-brand” models agreeing) turned out to be the weakest signal in the dataset.

This broke my prior. I assumed pair-agreement scaled with reliability. It doesn’t. Which models agree matters more than how many.

Dropping Gemini changed the error rate

Gemini was disabled mid-stream when I hit quota ceilings. Comparing FP-dominant issues between eras:

Era Total issues FP-dominant Rate
3-model 91 19 20.9%
2-model 126 36 28.6%

(FP-dominant is derived from an automated verification step that re-reads each issue against the current code and classifies the findings — more on that below.)

Gemini’s removal correlates with a ~8 percentage point increase in FP-dominant issues. The delta is real but modest. Commit subject matter also shifted between eras — different files, different risk profiles — so this isn’t a clean A/B. Still, the direction is consistent with the earlier point: a third independent reviewer catches things the other two miss, and loses things the other two over-call.

The verification step is where the real signal lives

Every AI-review issue gets a second pass: a verification agent re-reads each finding against the current code and writes a comment classifying what it found — “fixed in commit X,” “verified false positive, the reviewer misunderstood Y,” or “real but not code-fixable.” That second pass is how I get from “three models flagged something” to “here’s which of those were real.”

Running verification across all 217 issues is what made this retrospective possible. It’s also where each model’s signature failure mode shows up most clearly (see next section). Without an automated verification layer, the best I could tell you is “I closed the issue” — which, as the table above hints, conflates “fixed a real bug” with “dismissed a confident hallucination.”

Consensus is a weaker signal than I thought

The verification comments attribute findings to specific models — (openai, critical) — verified real or (gemini) — flat-out wrong. The reviewer hallucinated. Aggregating those attributions, each model has a distinct failure mode:

  • Gemini: confidently wrong about framework APIs that changed between versions. Cited methods that don’t exist (getDarkBrandLogo), claimed imports were missing when they were on line 5, referenced v3/v4 Filament semantics for a v5 codebase. When Gemini hallucinates, it hallucinates with authority.
  • OpenAI: speculative framework internals. “Relation manager doesn’t receive owner record” — disproven by running a single tinker check. Reads the diff too literally; infers problems from code it didn’t actually trace.
  • Claude: hypothetical edge cases, especially nullability concerns on values that are structurally non-null. Also notable for self-retraction — Claude will sometimes flag something and then, within the same finding body, walk it back (“Disregard” / “not a bug per se”). The other two models never do this.

The pattern I didn’t predict: each model has a signature kind of wrongness. If you review enough of these, you start recognizing them.

The real surprise, though, was the consensus data. In the 2-model era, findings flagged by both Claude and OpenAI — the “consensus signal” I leaned on in the original post — got dismissed as false positive 16.7% of the time for critical findings. That’s higher than solo findings (12.9%).

Both models agreed. Both were wrong. More often than when only one spoke up.

Looking at the specific cases, the pattern is clear: two models trained on overlapping corpora tend to misread the same kinds of things. Greenfield scaffolding code that looks unwired. Driver stubs that intentionally throw. Framework idioms they both over-generalize (“morph columns need explicit length,” “::class returns the interface, not concrete” — both wrong, and both models agreed both times).

Consensus only protects you against errors the models make independently. When they share a blind spot, consensus makes the blind spot look like a confirmed bug.

This is the strongest argument yet for a third model. Not because three is magic, but because the probability of three independent systems sharing the same hallucination is meaningfully lower than two.

What actually gets fixed

Claude almost never flags a sole critical. In the 2-model era, it flagged a sole critical on 7 issues out of 126 — every one confirmed real on verification. Low-recall, high-precision.

OpenAI is the opposite: high-recall, tolerable-precision. It flags five criticals on a commit; four are real, one turns out to be a misread of the diff. Most “verified false positive” verdicts trace back to an OpenAI finding.

A particularly sharp example from issue #279, a SMS conversation refactor: Gemini flagged a “data leakage on null IDs” critical. Verified real — mounting SmsConversation without leadProfileId or clientProfileId was returning every SMS message for the entire company. Cross-tenant leak. Fixed with early-return guards and a regression test.

Same commit, same diff: OpenAI flagged a “relation manager doesn’t receive owner record” critical. Verified false positive — a tinker session confirmed EmailRelationManager::getOwnerRecord()->id === $lead->id. Filament’s HasRelationManagers passes ownerRecord exactly as expected. OpenAI was speculating about framework internals it hadn’t actually traced.

Two critical findings, two different models, same commit — one a real cross-tenant security bug I’d introduced while trying to fix a different security bug, one a confident misread of framework internals. Running only one model would have missed half the picture.

What I’ve changed

A few adjustments based on the data:

  1. Gemini is back on. The FP-rate delta alone justifies the API spend — even if it’s 8 points instead of 16.
  2. Pair-agreement is now flagged in the digest, not just full consensus. OpenAI+Gemini agreement especially jumped out as high-signal for runtime bugs.
  3. I read OpenAI’s criticals with more skepticism than Claude’s. Not dismissive — skeptical. If OpenAI is alone on a critical, I verify against the code before acting. If Claude is alone on a critical, I just fix it.
  4. Solo findings are not noise. The original post undersold complementary coverage. 9 out of 10 real bugs are caught by exactly one model. If I filtered to consensus-only, I’d miss most of what the system actually catches.
  5. Verification is now mandatory, not optional. Every issue gets an automated second pass that classifies each finding against the current code. That’s how you turn “three models flagged something” into a decision you can actually act on — and it’s what made this entire retrospective possible.

The original claim, revised

The old framing:

Consensus tells you it’s definitely real. Complementary coverage tells you no single model sees everything. You need both.

The revised framing, with 217 commits of evidence:

Consensus is rare, often wrong when only two models are present, and reliable only at three-way agreement. Complementary coverage is where most real bugs are caught. The model agreement table isn’t a triage filter — it’s a map of which model you should trust on which kind of claim.

Same conclusion, different weights. One LLM still isn’t enough. Two is better than one. Three is better than two by a margin bigger than I expected.

If you’re running a single-model review loop, the data is pretty clear on what you’re leaving on the table.

CJ
Reply by email
[email protected] · I read everything.
$ reply