Someone posted a thinking piece in a Slack channel I’m in last week, long and earnest and well-structured, arguing that quality engineering needs to evolve for the age of agentic AI, that we need to stop thinking about quality as testing and start treating it as a systemic property. I read it and felt a familiar irritation, and the irritation wasn’t with the argument. The irritation was with the fact that I’ve seen it before, in almost identical language, every eighteen months for the past decade. The faces change and the vocabulary updates, but the insight stays the same, and the profession absorbs it, nods along, and keeps doing what it was already doing.
This is the cycle I keep watching from workshops and conference stages: a new technology arrives, it makes our assumptions visible, someone writes up “quality is systemic” as though the thought just occurred to them, and we all agree enthusiastically and change nothing.
The same verb, conjugated differently
Consider the timeline: manual testing in a staging environment, automated regression suites, Shift Left, ephemeral test environments spun up per pull request. Each one is a genuine improvement over what came before. But every single one is an expression of the same underlying activity: verification. Checking whether quality is present. The question is always “does this match what we expected?”, asked sooner or later, by a human or a machine, against a fixed set of criteria.
Shift Left is the clearest case. When I check that acceptance criteria are clear and well-formed before development starts, I am still verifying. I’ve moved the verification earlier in the process, which is useful and sometimes admirable, and the nature of what I’m doing has not changed. The mental model remains: quality is a property of the output, and my job is to confirm its presence.
I want you to try something. Describe your quality practice, the actual activities you do, the conversations you have, the artefacts you produce, without using the words: test, check, verify, validate, assert. Sit with that for a moment. The difficulty of the exercise is not a coincidence. It’s the frame showing itself. The vocabulary we have for quality work is almost entirely the vocabulary of verification, and that shapes what we can imagine the work being.
Where the verb stops working
AI agents break the verification frame, and they do it in ways that compound on each other.
First, stochastic behaviour. The same prompt, run twice, can produce structurally different output. Binary pass/fail assertions assume deterministic systems where the same input reliably produces the same result, and that assumption collapses when the system operates probabilistically.
Second, a non-enumerable output space. Agents compose actions across reasoning steps and tool calls in ways that resist pre-specification. You cannot write assertions against a space you cannot describe in advance. The traditional testing model requires you to know what “correct” looks like before you check for it, and agents generate behaviour that sits outside any catalogue of expected outcomes.
Third, scalability inversion. Agents generate code, decisions, and artefacts faster than any human can review them. The gatekeeping model, a human or a test suite standing between the system and production, becomes a bottleneck so severe it defeats the purpose of using agents at all.
These three problems don’t sit neatly beside each other. They compound. A system that is stochastic and non-enumerable and faster than you can review is a system that verification, in any form, cannot keep pace with.
Building the evaluation in
The response to this looks nothing like “verify better” or “verify faster.” It requires different activities entirely:
- Probabilistic test semantics: pass, fail, and inconclusive as first-class outcomes, backed by statistical confidence intervals rather than binary assertions
- Multi-dimensional coverage maps that measure behavioural breadth rather than line execution
- Trajectory evaluation, where you assess the reasoning path an agent took and not just the final output it produced
- Harness-first engineering, where you build the evaluation infrastructure before the system runs
- Deterministic control planes that intercept every agent proposal before it touches the real world
These are design activities. They happen before the system runs, as part of building it. They are closer to architecture than to testing, and they require a different relationship to the system and a different understanding of what “quality work” means.
The pattern underneath the pattern
This is what frustrates me most, and I say this as someone who runs workshops on quality thinking, who speaks at conferences like Agile Testing Days and Motacon (formerly TestBash) and HUSTEF among others, who watches these conversations happen in real time across Slack channels and meetups and post-talk corridors. The profession keeps arriving at “quality is systemic” as though it’s a fresh revelation, every time a new technology makes the verification frame obviously inadequate. It happened with continuous delivery. It happened with microservices. It is happening now with AI agents.
Each time, the insight gets written up and shared, and then we collectively fail to build on it. We don’t accumulate. We rediscover. The next generation of practitioners starts from scratch, makes the same journey from “testing is checking” to “quality is designed in,” and arrives at the same destination their predecessors reached five years earlier without any awareness that the path was already walked.
That pattern, the failure to treat our own professional learning as a systemic concern, is itself a quality problem. We don’t have a new problem with AI agents. We have an old problem that agents have made impossible to avoid. The verification frame has always been too small. The difference now is that “too small” no longer means “limited.” It means “inoperable.” And we are still, in Slack channels and conference talks and earnest blog posts, rediscovering this as though nobody said it before.

