leek.io leek.io
contact
#ai #review

One LLM Isn't Enough

CJ
Chris Jones · @leek
Apr 3, 2026 · 5 min read · 900 words
One LLM Isn't Enough

I write most of my code with AI now. If you’ve read anything else on this blog, that’s not a surprise. But here’s the thing nobody talks about: if you’re leaning on a single model to write your code and review it, you have a major blind spot.

Every model hallucinates. Every model has context limits. Every model was trained on different data, at different times, with different priorities. Claude is not GPT is not Gemini. They don’t think the same way, they don’t fail the same way, and they don’t catch the same things.

If you’re using one model to write code and that same model to check it, you’re asking the student to grade their own homework.

The problem with one model

LLMs are confident. Dangerously confident. A model will generate code that looks clean, passes a gut check, and introduces a subtle bug that the same model will happily overlook when you ask it to review. Not because it’s bad at reviewing — because it has the same blind spots it had when it wrote the code.

I’ve seen this firsthand. A refactor drops a scoping filter from a database query. The code looks fine. The model that wrote it thinks it’s fine. But now you’ve got a query that can return data belonging to the wrong tenant. That’s not a style nit. That’s a production incident.

Context windows make this worse. A model reviewing a diff doesn’t have the full picture of your codebase. It doesn’t know what the old code did unless you show it. It fills in the gaps with assumptions, and those assumptions are where bugs hide.

Multiple models, different blind spots

The fix isn’t a better prompt. It’s a second opinion. And a third.

I run every commit through three models: Claude, GPT, and Gemini. They review the same diff independently, and their findings get merged into a single report with a model agreement table — which models flagged which issues, and where they agree.

When all three models independently flag the same line, it’s real. Every time.

That’s the consensus signal. If Claude, GPT, and Gemini all say “this line has a problem,” I fix it immediately. No further investigation needed.

But consensus is only half the value. The other half is complementary coverage. In complex commits, each model often finds different real issues. One model catches a null reference. Another catches a missing validation. A third catches a security vector in the same code that the other two ignored completely.

I had a single commit where three distinct critical bugs were found — data loss in one path, an overwrite in another, a constraint violation in a third. Each bug was caught by a different model. No single model found all three. The union of their findings was the only complete picture.

Two layers: local and background

I use two tools for this, and they serve different purposes.

For local, in-flight work — planning, writing, checking as I go — I use Aaron Francis’s counselors. It’s a Claude Code skill that lets you get parallel second opinions from multiple models while you’re working. Think of it as a quick sanity check before you commit.

For background review, I have a GitHub Actions workflow that triggers on every push to main. It extracts the diff, sends it to Claude, GPT, and Gemini in parallel, collects their findings, deduplicates them, and creates a GitHub issue if anything critical surfaces. It includes a model agreement table showing exactly which models flagged each finding.

I work on personal projects alone. I don’t submit PRs to myself. This workflow is my code review team.

I’ve open-sourced the workflow as a gist if you want to see how it works.

What the data actually shows

After running this on hundreds of commits, a few patterns are clear.

Refactor regressions are the most common real bug. Renaming, restructuring, moving code around — these are the commits where logic gets silently dropped. Models are surprisingly good at catching when “the new code doesn’t do everything the old code did.”

Security issues get the highest model agreement. Cross-tenant bugs, file inclusion, injection vectors — when it’s a security problem, the models tend to converge.

And the model agreement table changes how I work. When I get an issue, I don’t just read the findings. I look at the agreement. If two or three models flagged it, I fix it. If only one model flagged something with low confidence, I usually move on. The table is the triage.

Consensus tells you it’s definitely real. Complementary coverage tells you no single model sees everything. You need both.

Why this matters

AI-assisted development is moving fast. Most people are focused on how to generate code faster. Not enough people are thinking about how to catch the mistakes that come with moving that fast.

If you’re a solo developer shipping with AI, you don’t have a teammate to catch the thing the model missed. But you can build one. Three API calls and a shell script, and you have a review team that never sleeps and never rubber-stamps.

The models are good enough to write the code. They’re also good enough to check each other’s work.

CJ
Reply by email
[email protected] · I read everything.
$ reply