Fighting AI with AI is the easy part

I’m partnering with SmartBear on a sponsored series of content about tools for the AI era. As always, all opinions are entirely my own.

SmartBear ran a survey in January 2026 of 273 software testing and quality decision-makers, mostly directors and above, and the writeup, their AI Software Quality Gap Report, lets the numbers speak for themselves. 93% of the surveyed organisations have adopted AI coding tools, 40% now generate at least 41% of their code with AI, 68% are worried that faster AI development will create testing bottlenecks, 70% say application quality is already suffering, and 60% have experienced quality issues in the past year because development is outpacing testing. The line that stopped me is the one underneath all of those: 92% still test manually.

Read them together and you have the arithmetic of the gap. AI is writing the code, people are still checking it, the distance between those two facts widens every week, and the decision-makers running the testing function know.

The response circulating in the discipline, the one I keep hearing in workshops and on conference stages and in the Slack channels I sit in, is “humans need to review more carefully”. I wrote last week about why I think that response is structurally broken: code review was already a weak defect-catcher before AI, the volume of AI output makes the line-by-line model impossible, and the move that actually works is one where another agent reads the code first and the human enters the loop at the point where judgment matters.

“Another agent” hides a choice that matters more than it first looks. The agents most teams reach for first work on the code itself: generating unit tests, reviewing diffs, scoring a pull request for risk. That is the familiar shape of quality work, what the discipline has long called code integrity, and AI slots into it naturally because the code is right there to read. The problem is that code integrity was never the whole job. Hand an AI agent an abstract instruction and it can write clean code that passes every unit test you have and still do things nobody asked for, because “does this code run” and “does this software do what the user needed” are different questions, and AI-generated software fails in the space between them. Answering the second question means moving the check off the source and onto the compiled application as a user actually meets it, which is what SmartBear calls application integrity: a continuous test of whether the running experience does what it was meant to, run against the application rather than the source.

This is what SmartBear built BearQ to do. They call it an “agentic QA system with always-on teammates”. Where a code-review bot reads the diff, BearQ works on the running application itself, exploring it the way a user would, learning the workflows, and adapting when the AI-generated UI shifts underneath it. The framing makes sense to me, because the underlying argument matches what the engineers and testers I speak to are already improvising for themselves.

The conversation about whether to use AI to test AI is finished in any team that has run the volume numbers honestly, and the conversation worth having now is about how to set the testing agents up so they help the team instead of multiplying the problem.

What agentic QA actually solves

Three problems are discussed in SmartBear’s report, and each one is a place where the existing toolkit doesn’t help.

The first is volume. The 92%-still-manual statistic is brutal because you cannot manually test 40%+ AI-generated code at AI-generation speed, and writing the arithmetic out on a whiteboard makes that obvious in a way no staffing plan can mask. Either you ship faster than you can verify, which is what most teams are doing whether they admit it or not, or you slow the developers down, which the management above them will not accept. The exit from that trap is a testing surface that scales with throughput, and the only thing that scales with AI-generated throughput is AI-generated checking.

The second is the brittleness of script-based automation against rapidly-changing UIs. This is the bit SmartBear keeps returning to in their messaging, and they are right that traditional automated testing scripts struggle when the application under test is itself being rewritten by AI assistants weekly. Selectors break, object properties shift underneath the assertions, and the maintenance overhead of the test suite climbs faster than the productivity gain you got from generating it in the first place. An agent that understands what the application is supposed to do, rather than which DOM nodes it currently has, is more durable when the application changes underneath it, and SmartBear’s framing for this is “intent over exact match”, which names a real shift when the underlying UI is moving constantly.

The third is attention decay. Anyone who has reviewed a long pull request knows the experience: the first hundred lines get real attention, the next four hundred get a skim, and the last hundred get a rubber-stamp. The Cohen study I cited last week put numbers on this; reviewer effectiveness drops sharply after about an hour. Manual testing degrades the same way, and the fortieth click-through of a release-candidate dashboard is not as careful as the first. Agents do not degrade like that, and they check the four-hundredth interaction with the same scrutiny they brought to the first.

Those three problems, taken together, are what BearQ-style systems are built to address, and saying so is not marketing. The problems are real, the existing toolkit struggles with them, and an agentic approach is structurally different from anything the script-based world can offer.

The agent doing the checking needs checking too

Most of the teams I talk to are still working this part out as they go. The move is now common enough that the products exist, and the operational discipline around them is being learnt one installation at a time.

A testing agent, in the broad sense I mean here, is any LLM put in a loop to assess software, whether it reviews code or explores a running application, and that category includes BearQ-style application-integrity systems alongside the code-level reviewers. The same family of system that produces the code is being asked to produce the assessment of the code. Anyone who has spent time with these models in evaluation contexts knows what that produces: family bias, rubric drift, brittleness under perturbation, and a confident output regardless of whether the underlying analysis is sound.

I wrote a few months ago about LLM-as-a-judge for AI product evaluation, and the punchline applies here without much modification. The rubric you give the testing agent is itself a prompt. It will be followed probabilistically. Two models from the same family will tend to like each other’s output more than they should. Instacart, in the public writeup of how they use multimodal judges to evaluate product replacements, found that their evaluator needs continuous calibration against real user behaviour to keep its judgments aligned with what users actually want. The evaluator is itself a tool that needs evaluating, and the moment you forget that, the scores it produces start to drift away from anything meaningful.

There is now empirical evidence that this matters specifically for testing agents. A January 2026 paper called ReliabilityBench¹ puts LLM agents through perturbation strategies, things like synonym substitution and distractor injection that any production environment will produce as a matter of course, and finds that agents which look consistent on clean benchmarks become brittle the moment the inputs vary in ways the benchmark didn’t anticipate. A separate study from the same month, looking at the flakiness of LLM-generated tests across four database systems including SAP HANA, DuckDB, MySQL and SQLite², found that LLM-generated tests have a higher proportion of flaky outcomes than the existing test suites. The agents are not yet as reliable as the press releases suggest, and the gap is measurable.

A more recent paper, “Willful Disobedience” from March 2026³, looks at the same problem from the other end. It shows that the failures of testing and evaluation agents are visible in their reasoning traces, often well before the outputs go wrong. If you instrument the agent properly, you can see when it is about to fail, before the wrong output reaches you.

The agents are brittle, and the agents are also instrumentable, and setting them up well means working with both of those facts at once.

What careful and intentional setup looks like in practice

Take these five questions as the starting point. They are not exhaustive or in priority order; they are the five I have arrived at after watching the points where leaders and teams get stuck setting this up, in workshops and at conferences and in the conversations that follow.

The first question is what metric tells you whether the agent is doing its job. The metrics worth monitoring for AI maturity, things like override rate and impact on P&L, are mostly not being monitored even where the data to compute them already sits in the system. Override rate is the right metric for any agent of this kind: how often does the human who reads the agent’s findings overrule them? A 90% override rate means the agent is producing noise; a 5% override rate means the agent is either pitched too low or the human has stopped reading. Without that number you cannot tell which.

The second question is who calibrates the agent and how often. Instacart’s pattern is the cleanest public version of an answer: the agent runs against production, its judgments are compared against what the team actually wanted, and the disagreements feed back into how the agent is configured. The teams I have seen do this badly treat the agent’s configuration as a one-off setup task, write a rubric in week one, and never look at it again. The agent drifts, the rubric stops matching what the team cares about, and within six months everyone has forgotten that the assertions the agent makes are based on a snapshot of priorities that no longer applies.

The third question is what you watch when the agent is running. The Willful Disobedience paper makes the case clearly: the failures are visible in the trace before they reach the verdict, and the operational discipline is to instrument the agent’s reasoning because the reasoning often shows the problem first. What did the agent look at, what did it consider and discard, and how confident was it when it told you the change was safe? An agent whose reasoning trace is opaque is an agent you cannot diagnose, and an agent you cannot diagnose is an agent you cannot improve.

The fourth question is what authority the agent has. The teams that get into trouble are the ones who let the agent escalate decisions it should not be making, or who let it auto-merge when it should be flagging for review. The reviewer-agent pattern I described last week works precisely because the agent does not have merge authority; it has finding-publication authority, and the human has the merge call. BearQ-shaped systems can be set up either way, and the setup decision matters considerably more than the choice of product itself.

The fifth question is what “good enough” means for this product at this floor. I wrote about this earlier in the year, that the quality profession’s actual job is to define what working means at the new floor for this kind of software at this kind of scale, and the same logic applies internally to any specific installation. A testing agent for a financial trading system needs a different definition of acceptable than a testing agent for an internal HR portal, and the setup discipline is not generic; it is shaped by what the application is for and what the cost of getting it wrong looks like for the people who use it.

We are early

Anthropic reported in early 2026 that the 99.9th percentile turn duration for their coding agents had nearly doubled in three months, from under 25 minutes in October 2025 to over 45 minutes by January 2026⁴. The agents are getting longer-running, more autonomous, and more capable of operating without human intervention for extended stretches, and the same shift is showing up in testing agents. The gap between what they can do unsupervised and what the discipline has built up to oversee them is widening, not closing.

Simon Wardley’s framework for technological industrialisation is useful here. He has long argued that practices co-evolve with industrialised technology⁵, and that the practices around a newly industrialised technology take years to converge with the technology itself. We are only a few years into the industrialised machine learning curve, which puts us partway through the period in which the practice is supposed to catch up. We have the products, and we have some of the patterns, but the metrics are mostly missing, the instrumentation is improvised, and the institutional knowledge of what to watch is being learnt on the job, which is what early looks like before the practice has set.

Any team setting up an agentic QA system in 2026 is doing two pieces of work, not one. The first is using the agent to test the application. The second is building the practice around the agent itself: the override-rate dashboards, the calibration loops, the trace instrumentation, the escalation rules. The second piece is the one nobody hands you in the box, and it is the one that decides whether the installation pays off.

What the survey actually asks of us

The most uncomfortable line in the SmartBear report is the one about leadership: 65% of the surveyed decision-makers believe their own leadership does not recognise the AI testing risks. The people running QA know the gap is widening and they know the agents that will close it need careful setup, and they believe the people above them in the organisation see this largely as a tooling problem that can be solved by buying a product.

The product matters, because tools shape what is possible, and a well-built agentic QA platform gives a team something they did not have before, which is a real advance. The product is also the easy part of the answer. The hard part is the discipline of running it well: defining the metric, calibrating the agent, instrumenting the trace, drawing the authority line, deciding what good enough looks like in your context. That work does not come in the box, and it is the part of the answer that nobody bought when they bought the product.

The teams that do this work get application integrity that actually holds at AI speed: a quality surface that scales with the throughput of AI-generated code, with the human kept in the loop exactly where judgment changes the outcome. The teams that skip it will get more code, less assurance, no recovery path, and a leadership team who thought the product was the whole answer.

Fighting AI with AI is the move, and it is also, for the next several years, a discipline that is still being built. Anyone setting it up now is doing some of that building whether they meant to or not, and the careful and intentional version is the one where the team knows that and designs for it. The other version is the one we will be reading about in next year’s “biggest AI fails” list.

References

“ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions”, arXiv:2601.06112, January 2026.
“On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems”, arXiv:2601.08998, January 2026.
“Willful Disobedience: Automatically Detecting Failures in Agentic Traces”, arXiv:2603.23806, March 2026.
Anthropic, “Measuring AI agent autonomy in practice”, 2026.
Simon Wardley, “Anticipation. Chapter 9, Wardley Maps”, Medium, on the co-evolution of practice with technological industrialisation.