The Wrong Human-in-the-Loop

The Wrong Human-in-the-Loop

Here is what I do when I write code these days. I describe the change I want to one agent and let it write the code. I hand the result to a second agent, configured as a reviewer, using Bryan Finster’s Agentic Dev Team plugin for Claude. The reviewer reads what was written and gives me a summary: what it found, what it thinks I should know, what looks risky. I read the summary. I decide what to do with each finding.

Sometimes I fix the issue myself, because the fix is small and the right move is obvious. Sometimes I ask the agent to fix it, because the agent is faster and the fix is mechanical. Sometimes I leave it, because the finding is real but not urgent, and I create a GitHub ticket so it surfaces the next time I sit down to triage. Sometimes I ignore the finding entirely, because the agent has flagged something that doesn’t matter in this codebase or for this change. That sequence is the loop.

I am describing the loop first because most of the conversation about AI code quality assumes a different shape. The default assumption is that the human reads every line. I keep seeing it in talks at conferences I’ve spoken at, in workshops I’ve run, and in the threads I read on Ministry of Testing. “Humans need to review all of it,” people say. “That’s how we sift through the slop.” That has the ring of a responsible answer, and it is structurally impossible, and it is built on a misreading of what code review has ever done.

The volume problem is real and measurable

Start with the arithmetic. AI tooling produces code at a rate several times what an unaided engineer can produce, by every measurement that has been published. If you redirected every QA engineer and every software engineer at your company to do nothing but review AI output, the queue would still grow faster than the team could clear it, because the people writing the code are still producing more of it. Whatever productivity you bought with the AI evaporates into the bottleneck you’ve created by trying to inspect everything.

This is not theoretical. GitClear’s 2025 report on Copilot code quality measured defect rates and code churn in real codebases that adopted AI assistance and found both rising.1 A February 2026 study analysed 33,596 pull requests authored by AI agents across Copilot, Codex, Claude Code, Devin, and Cursor, and found that reviewer engagement patterns are shifting depending on which agent wrote the code, with merge rates ranging from 43% to 82%.2 The review process is becoming agent-shaped whether anyone planned for it or not.

Code review was already a weak defect-catcher

The deeper problem is that human code review never reliably did the thing we want it to do now. The empirical research has been consistent for over a decade.

Bacchelli and Bird’s 2013 study at Microsoft found a gap between what developers expected from code review (catching defects) and what it actually delivered (knowledge sharing, code improvement, team alignment).3 Sadowski and colleagues studied review at Google in 2018 across nine million reviewed changes and found that most comments were about readability and education, not bugs.4 A 2024 study replicating those findings in a mid-size company saw the same pattern.5 The Cohen study at SmartBear and Cisco, still the most-cited number on the topic, showed reviewer effectiveness dropping sharply after about an hour and 400 lines of code.6

None of this means review is useless. It means review’s dominant value is in keeping the team aligned and the codebase coherent, not in catching the defects we’re now asking it to catch from AI output. The thing we are now asking humans to do more of is the thing the research has known for over a decade does not do what we think it does.

So the response of “let’s review every line of AI’s output” is doubly broken. The premise that volume can scale to allow it is wrong. And the premise that line-by-line review reliably catches bugs is wrong on its own, before AI even enters the picture.

What the reviewer agent preserves

The practices with stronger evidence for defect reduction live somewhere else. Test-driven development, when measured carefully, reduces defects in shipped code by a non-trivial amount.7 Trunk-based development with small, continuously-integrated changes outperforms long-running branches with batched review on every operational metric measured by DORA over the last decade.8 Pair programming sits in the same family and gets honourable mention, though the meta-analyses on it are less conclusive than the popular story suggests.

The reviewer-agent-plus-human-triager pattern fits in this family. What it preserves is the bit of code review that always worked: a second pair of eyes on the code, looking for the obvious problems, before the change is merged. What it abandons is the bit that never worked: a single tired reviewer trying to maintain attention across hundreds or thousands of lines of someone else’s code. The agent does not get tired, and it does not skim past line 380 because it has been reading for 90 minutes. The human enters the loop at the point where judgment is needed, not before.

But isn’t this just stacking AI on AI?

The strongest pushback to all of this is that a human who reads the reviewer agent’s findings is not reading the code, and you have therefore added a layer that can fail silently. The reviewer agent does not know what it missed. A tired human at least knows they are tired. If we accept this pattern, we are conceding that nobody read the code, and that is a step backwards. That is the right objection to raise, and it deserves a careful answer.

The objection assumes a choice between an attentive human reading every line and a reviewer agent reading every line. That is not the choice on offer. The volume of AI output forces a half-attentive human skimming the parts they can stand to read the moment you insist on the first option. The reviewer agent is a defence against the world we actually work in, where nobody is going to read every line carefully.

The reviewer agent can also be instrumented in ways the tired human cannot. You can track what it flagged. You can track which findings the triager accepted and which they overruled. You can measure how often something it didn’t flag turned up later as a defect. That feedback loop, applied to the agent, is the version of “more skilled review” that can keep getting better, because the agent’s behaviour can be inspected and tuned. The version that asks a human to attend harder for longer has never worked, in the workshops I’ve run or the codebases I’ve looked at, and the research suggests it never will.

Where this applies, and where it doesn’t

This argument is about teams running AI code generation at meaningful volume. If your team has one engineer using Copilot for autocomplete a few times a day, traditional review still covers it, because volume hasn’t broken anything yet. The reviewer-agent pattern becomes load-bearing at the point where every reviewer’s queue is already past what attention can cover.

For most of the people I talk to in workshops and at conferences, that point has already arrived. They are being told by management to be “twice as productive” because AI exists. They are also being told that quality cannot drop. They are responsible for both ends of a contradiction. The reviewer-agent pattern is one of the few moves that gives them a way to honour both, because it puts the human’s time where it matters and lets the agent absorb the part of the work that was always tedious and always unreliable.

The human’s actual job

The human in this loop is not reading every line. The human is making judgments. Which findings matter for this change. Which findings can wait. Which findings reveal a pattern that should be fixed at the source. When to push back on the agent. When to ask a colleague. When to ship.

That is the part of the work that the QA leads and engineers I speak to are best positioned to keep doing. It is the part nobody is automating. It is the part that distinguishes a senior tester from a junior one and a senior engineer from a junior one. The people I meet who are most afraid that AI is being used to make them redundant are the people whose judgment is the most concentrated form of value in any team they work in. Putting them in front of every line of agent output is the surest way to dilute it.

The loop worth putting them in is the one where their judgment does the work: deciding what the agent’s output means, and what to do about it.


  1. GitClear, “AI Copilot Code Quality: Evaluating 2024’s Increased Defect Rate” (2025).↩︎
  2. “How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses,” arXiv 2602.17084 (February 2026).↩︎
  3. Bacchelli & Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” Microsoft Research (2013).↩︎
  4. Sadowski et al., “Modern Code Review: A Case Study at Google” (2018).↩︎
  5. “Developer perceptions of modern code review processes in practice: Insights from a case study in a mid-sized company,” ScienceDirect (2024).↩︎
  6. Cohen et al., SmartBear / Cisco code review effectiveness study (2006).↩︎
  7. Nagappan, Maximilien, Bhat & Williams, “Realizing quality improvement through test driven development: results and experiences of four industrial teams,” Empirical Software Engineering (2008). Reported defect-density reductions of 40–90% across the four teams studied.↩︎
  8. DORA / Accelerate, multi-year research on continuous delivery and trunk-based development.↩︎

Leave a Comment

Your email address will not be published. Required fields are marked *