skip to main content

Making a multi-agent code reviewer that doesn't cry wolf

2026 / Sole-built at IT-Bauschmiede / Python + Claude Agent SDK + FastAPI + asyncio + GitLab + Docker

Making a multi-agent code reviewer that doesn't cry wolf

TLDR

Designed and built an internal multi-agent code review bot on the Claude Agent SDK as sole author. The bot runs in CI on every merge request via GitLab webhooks, fans out into multiple parallel review lanes, then gates each finding through a verification chain before posting. Used it on real merge requests in the production codebase. The interesting work was not getting agents to find issues. It was getting them to stop reporting issues that were not actually there. The architecture takes many of the principles from Claude's GitHub code review agent, adapted for GitLab.

What I was solving

The motivation was to explore building something like Claude's own code review bot (a multi-agent reviewer that runs on every merge request), while reducing the false-positive rate so the bot doesn't cry wolf. AI-assisted development has changed the rate at which merge requests appear, and code arrives faster than careful human review can keep up with on a small team. But the off-the-shelf AI review tools I had seen all failed in the same way: they pattern-match. They see a code shape that resembles an anti-pattern and post a comment, regardless of whether the surrounding code makes the pattern harmless. When most of the comments are wrong, the comments stop being read.

I wanted the opposite: every comment defensible, so a developer reading one has reason to take it seriously. That is a verification problem, not a coverage problem. The architecture I built solves for that.

The concepts that shaped the architecture

A few principles drove most of the design decisions:

Deterministic first, LLM second. This is about token efficiency. The obvious issues (no tests on changed code, regex-matchable security patterns, the trivial filters that decide whether a diff is even worth reviewing) should be caught by deterministic scripts before any LLM is invoked. That way the expensive LLM reasoning is reserved for the edge cases where reasoning actually matters, instead of being spent re-discovering trivial things a grep could find.

Multiple lenses, not one super-reviewer. The bot fans the diff out across several review lanes in parallel. Each lane looks at one specific dimension — does this comply with the documented standards, what could it break elsewhere, does it implement what the linked issue asks for, are there design choices that fail under pressure, is it deployable and rollback-safe, are the tests meaningful, and a final senior-engineer-style holistic pass. One lane uses Codex as an additional code reviewer, so the system isn't single-vendor on its judgement. The shape of this is inspired by Andrej Karpathy's "council of agents" concept: multiple specialised reviewers each casting one vote beats a single generalist trying to hold everything in its head.

Fresh, isolated context per lane. This follows Anthropic's guidance on harness design for long-running agents: no LLM should evaluate its own output. Each lane runs in its own context so confirmation bias from one stage can't leak into the next.

Model tier matched to the lane's job. Pattern-matching lanes get a faster, cheaper model. Lanes that need to reason about intent or design alternatives get the heavier one. Choosing per-lane was much cheaper than running every lane on the strongest model.

The lanes produce a list of findings. The harder problem starts there.

Verify before flagging

Anyone can fan out parallel agents at a diff. The harder problem is what comes next.

Everything downstream of the lanes follows one rule: verify before flagging. The biggest source of false positives, every time, was an agent matching on code shape without checking whether the matched shape was actually reachable or exploitable. The fix is structural: before any finding gets posted, it has to be confirmed against the real code.

That happens in stages. A per-finding validator traces the claim back through actual execution paths. A lane might say "N+1 query on this GET path." The validator follows the view → queryset → serializer → response and checks whether a serializer is actually involved. If the view returns plain dicts, there is no N+1, and the finding gets marked invalid.

When a lane and its validator disagree, a second-opinion validator breaks the tie. If both validators say the finding is wrong, it gets dropped. If the second opinion sides with the lane, the body of the finding gets rewritten to reflect only what was confirmed — so a developer never reads a scary title followed by a debunked premise.

Surviving findings go to a judge that handles the deterministic post-processing: confidence-threshold filtering, semantic dedup so the same conceptual issue from multiple lanes collapses into one annotated finding, severity calibration when the validator's notes contain markers like "harmless" or "theoretical," and a hard cap on findings per review so reviewers don't drown.

A final fact-check agent reads each surviving finding next to the code one more time and confirms its claims are still true. That is the last gate before anything reaches the developer.

How I tuned it

The first version of the bot worked in the narrow sense that it fanned out lanes in parallel and posted comments. It also cried wolf constantly. Every round of tuning after that went the same way: the bot posts a comment, a developer correctly disagrees, I trace the disagreement back to the underlying confusion, and I add a counter-example to the lane's prompt or to the validator's checklist. To accelerate this, I ran the bot against previous merge requests and compared its findings to the human review comments those MRs had received. That gave a much larger pool of "the bot said X, the human said Y" data to learn from than waiting for live MRs alone. After a few rounds of iteration, the prompts contained a small library of "this looks like X but isn't X" patterns, and the false-positive rate was somewhere I was willing to ship.

Tuning also went beyond prompts. A lot of the early false positives were really issues that should have been caught earlier in the pipeline: by static analysis, by linters, by CI gates. Where I could, I pushed those checks left into the pipeline itself, so the bot no longer had to flag them and could focus its attention on things deterministic tooling can't catch.

The corrections from this loop aren't to the verification chain; they're to the lane prompts (and, where appropriate, the pipeline). The chain itself is generic. The lane prompts are where domain knowledge accumulates. That separation is what made it possible to keep iterating: every false positive teaches the prompt one more thing without changing the bot's architecture.

Operational decisions I made

A few choices worth flagging:

  • Wiki-editable agent prompts with a TTL cache. Tuning behaviour means editing a wiki page, not redeploying. The whole team can read the prompts and adjust them, so the bot's behaviour is not hidden behind a deploy cycle. Iteration would not have been feasible at deploy speed.
  • Shadow mode by default. Every new version of the bot ships in shadow first: it logs what it would have posted, but doesn't post. I read the logs for a few days before flipping it live. This caught more behavioural regressions than I expected.
  • Hard cost cap per review. A budget tracker kills any run that goes over a configured ceiling. Removes the worry that a runaway lane prints money.
  • Persistent bare clone with worktrees. Cloning a repo per review is slow when reviews happen on every push. A persistent bare clone plus a per-review worktree shaves seconds off every review and also lets multiple reviews run in parallel against the same repo without stepping on each other.
  • Block web access from agent contexts. Diffs are untrusted input, so this is a prompt-injection mitigation. An agent reviewing a hostile diff with web access is one well-crafted comment away from being instructed to exfiltrate or to post LGTM. Closing that vector is cheap and necessary.
  • Trigger allowlist. Only specific users can ask the bot for a fresh review by mention. Prevents the bot from being weaponised by anyone with repo access.
  • Run only after the CI pipeline completes. The bot triggers after CI has finished, not in parallel with it. If the pipeline fails, there's no point spending tokens reviewing code that's already known to be broken — the author needs to fix CI first.

What I'd say is the honest trade-off

The validator's "trust me" property (when the validator says a finding is invalid, the finding gets dropped) is load-bearing. It's how the bot stays quiet enough to be trusted. But it does mean a confused validator can hide a real issue. The second-opinion stage is the mitigation; it isn't a guarantee. I would not run this system without shadow-mode rollout and ongoing review of its logs. Human review of diffs remains critical; this system helps surface where to focus that human attention.

Technologies and patterns

Python · FastAPI · asyncio (with Semaphore for parallelism control) · Claude Agent SDK (Sonnet for structured analysis lanes, Opus for reasoning-heavy lanes) · cross-provider opt-in lane via CLI · GitLab webhooks + REST API · OAuth2 with PKCE · Docker · persistent bare-clone + worktree pattern · wiki-as-config (TTL cache) · deterministic post-processing (severity calibration, semantic dedup, cost capping) · prompt-injection mitigations (sanitisers, web-access block, trigger allowlist) · shadow-mode rollout