The anatomy of a metric - Kato Coaching

Session two closed with a question I couldn’t answer.¹ When a product scores 78% on a given metric, what tells you whether that’s good enough to ship? I flagged it as something session three would probably address, and it did, though not in the way I expected. The question can’t be meaningfully answered until you’ve defined what “good” means, and defining what “good” means is something testers have been doing for their entire careers. They’ve just been calling it something else.

Session three broke a quantitative eval metric down into four components: evaluator, rubric, test set, and pass/fail gate. Once I saw that structure laid out, it was hard to see it as anything other than familiar. It’s asking: who judges, what they’re judging against, what cases they’re working through, and what counts as passing. Anyone who has designed a test suite has operated inside that same shape, even if they’ve never used the word “metric” to describe it.

The session was careful to separate these components because the temptation, when building an eval, is to treat the whole thing as one problem and reach for a tool that handles it automatically. The components pull in different directions and each one needs its own decision. That’s worth taking seriously, because the failure mode the course keeps returning to is teams that skip the design work and go straight to the output.

The rubric is a test oracle

The rubric is the component I found most interesting, partly because of how familiar it felt and partly because of one specific recommendation the session made.

A rubric tells the evaluator, whether that’s a human rater or a model, what “good” looks like for a given output. It sets out the criteria being assessed, the scale scores should fall on, and the format the evaluation should take. The goal is to reduce rater drift: evaluators making slightly different judgements each time they score, or two evaluators scoring the same output differently because they’ve internalised different definitions of quality. A rubric aligns those judgements by making the definition explicit.

That is a test oracle. An oracle in software testing is a mechanism for deciding whether the actual output of a system matches the expected output: the thing that lets you say “this passed” or “this failed” rather than just “this ran.” Testers who have worked in domains where correct answers aren’t self-evident, medical software, financial systems, complex business logic, have spent considerable effort defining oracles that can carry any weight. The rubric in an AI eval is the same object, in a new context.

The specific recommendation that stuck with me was this: always ask the evaluator for a “reason” alongside the score. For a human rater that surfaces the thinking behind the number, and for a model evaluator it anchors the model’s judgement and reduces the drift that comes from treating scoring as a pure classification task. The reason field also gives you something to read when a score looks wrong, because you can see what the evaluator was responding to rather than just seeing a number you can’t interrogate.

Writing a good rubric is essentially prompt engineering, the session noted, which is another way of saying that the quality of the eval depends heavily on how precisely you’ve defined what you’re measuring. This is not a new lesson for anyone who has tried to write a useful acceptance criterion.

The test set is your test suite

Three options for building a test set came up in the session: open source benchmarks, curated sets, and synthetic data.

Open source benchmarks are easy to obtain, but the course has been consistent across sessions about their limitations. Static benchmarks can be contaminated: models train on the test data before being evaluated on it, which means a high score on a public benchmark tells you something about training, not necessarily about performance on your problem. The same logic that makes leaderboard scores unreliable in session one applies here.

Synthetic data is relatively easy to generate but carries a different risk: the distribution of generated scenarios may not match what you’ll actually see in production. Generating test cases from a language model produces a distribution shaped by that model’s priors, not by your users’ behaviour. The session’s point was that representative and exhaustive are not the same thing, and synthetic data can achieve the second while missing the first entirely.

Historical data (transcripts, usage logs, real interactions from production or a legacy system) is the preferred starting point, and the reasoning is exactly what you’d expect from a testing background. Sampled historical data is representative of what the system will face in production. It reflects the actual distribution of user behaviour, including the edge cases that weren’t anticipated during design and the guardrail violations that nobody thought to include in a scenario library.

The course calls a historical set that hasn’t been reviewed yet a “silver set,” and a silver set that has been validated by a subject matter expert a “golden set.” The distinction matters because a golden set is the thing you can make release decisions against. Getting there requires SME time, which circles back to the session two point about eval teams: the SMEs are the hardest people to find and the most critical to have.

Managing the test set properly turns out to be its own concern. The session was direct about this: the test set is a valuable asset, not a byproduct, and it should be treated accordingly: version-controlled, kept private, and not overfit against by tuning the product on the same cases you’re evaluating on. These are the same principles that govern any meaningful regression suite, applied to a different artefact.

The threshold question, and why nobody can answer it for you

By the time the session reached evaluation-driven development, I had enough context to understand the 78% question more clearly.

Evaluation-driven development treats every update to an AI product the same way a disciplined team treats any code change: the update should make the product measurably better, not just different, and the release decision should rest on evidence rather than intuition. The quantitative eval metric exists to produce that evidence. The CI/CD integration exists to ensure that hard gates, the things that must never fail, always get checked.

But the threshold for a release decision, the number at which you say “this is good enough to ship,” is risk-based. A product that gives medical guidance operates under different constraints than a product that recommends playlists. A metric that measures safety violations needs a near-zero threshold; a metric that measures tone quality has more room. The right threshold for a given metric in a given context is a product of the rubric, the use case, and the organisation’s risk tolerance.

This is how testing already works in any complex system. Pass/fail at the unit level is deterministic; at the system level, it involves judgement calls about what matters and how much. Testers who have worked with risk-based testing frameworks have already developed the instincts that evaluation-driven development is asking for. What’s new is the artefact being evaluated, not the way of thinking about it.

The honest answer to “is 78% good enough?” is: it depends on what your rubric was measuring, whether your test set was representative, and what the product is doing in the world. No figure is meaningful outside that context, which means the first job is writing a rubric careful enough that the threshold is actually worth choosing.

Try this.

Take one AI tool you work with or have observed in action (a code assistant, a chatbot, an automated test generator). Pick one thing it’s supposed to do well. Write a rubric for evaluating that capability: the criteria you’d assess, the scale you’d score on, the format the evaluation should take, and the “reason” prompt you’d give the evaluator. Don’t try to cover everything; pick one dimension and define it clearly. Cap it at 20 minutes.

The goal is noticing how much you already know about what “good” looks like, and how different that feels from having no definition at all.

—

This is the third post in a series. The first, What I don’t understand about AI evals (yet), covers the foundations: why great models don’t make great products, why benchmark scores mislead, and the first questions I couldn’t answer about LLM-as-a-judge. The second, The AI evals field chose a flawed tool and stuck with it, covers the sniff test, how eval teams are assembled, and why LLM-as-a-judge remains the dominant approach despite its documented problems. The course is the AI Evals and Analytics Playbook, taught by Stella Liu and Amy Chen on Maven. ↩︎