The oracle has opinions - Kato Coaching

This is part of a series where I write about what I’m learning in an AI evals and analytics course. Earlier posts cover the basics of evals, sniff tests, and quantitative evaluation with LLM-as-a-judge.

In traditional testing, the oracle is trustworthy. You know what the right answer is, or at least you know who decides. A test passes when the output matches the expected result, and the expected result is something a human wrote down, checked, and signed off on.

In AI evals, that foundation starts to shift.

The oracle was supposed to be the settled part

When you run a test suite against a deterministic system, the oracle is the expected output, and the expected output doesn’t have opinions. It doesn’t prefer outputs from certain vendors. It doesn’t drift across repeated evaluations or respond differently depending on how a question is phrased. It just checks.

The moment you introduce an LLM as your evaluator, you’ve replaced that stable oracle with something probabilistic. The evaluator model has been trained on data, shaped by its developers, and it brings tendencies into every judgment it makes. One well-documented tendency: it prefers outputs from models in its own family. An OpenAI evaluator will tend to score OpenAI-generated responses more favourably. This isn’t a bug someone introduced; it’s a property of how these models are trained. The term circulating for this is “family bias,” and it means your eval scores are partly a measure of how much the evaluator likes the evaluated model’s heritage.

The rubric is a prompt, which means it’s also uncertain

The LLM-as-a-judge setup works like this: you give the evaluator a query, the AI product’s response, and a rubric. The rubric defines what a good answer looks like. The evaluator scores the response against that rubric. Those scores become your metrics, and your metrics drive decisions about whether to release.

The rubric is where the complexity lands. Writing a good eval rubric is, as the course I’m working through made clear, essentially prompt engineering. You’re writing instructions to a model that will follow them probabilistically, and like all prompts, the rubric shapes the output in ways that aren’t fully predictable. Ask the evaluator to score for “helpfulness” without defining what helpfulness means in your specific context, and you’ll get scores that reflect the evaluator’s interpretation of helpfulness, which may or may not match your users’. Add a “reasoning” field, and the scores become more stable, because having to explain each judgment forces a kind of consistency. But you’re still writing prompts and hoping they land the way you intended.

There are approaches designed to reduce this. One, called AdaRubric, generates evaluation dimensions dynamically for each task rather than asking you to define them upfront, on the argument that evaluation criteria should be a function of the task rather than fixed properties of the evaluator. I haven’t tested it, but if writing the rubric is already prompt engineering, anything that reduces how much of that engineering you do by hand is worth knowing about. Read: Stop manually defining your evaluation criteria and rubrics

The course notes this plainly: writing the evaluation prompt is prompt engineering. Which means the same uncertainty that surrounds your AI product’s outputs surrounds your evaluator’s outputs too.

Instacart uses production experiments to tune the evaluator, not just test the product

This is where one example from the course stopped me.

Instacart uses a multimodal LLM judge to evaluate product replacement recommendations. They run this judge in parallel with production experiments before releases, and what they’ve found is that these experiments are useful not just for assessing whether the product is good, but for calibrating the evaluator itself. They use production data to fine-tune the LLM-as-a-judge.

Which means the experiment isn’t the final gate before release. It’s the mechanism that tells you whether your gate was measuring the right things.

Sit with that for a moment. The evaluator you’ve been using to decide whether the product is ready for users turns out to need calibration against real user behaviour to work properly. Your pre-release quality signal depends on a tool that you can only fully validate in production.

Your pre-release scores are telling you something, just less than they look like

So what are your pre-release eval scores actually telling you?

Not nothing. A low score on a well-designed metric is still a signal worth acting on. A rubric built carefully by the right people, tested against a curated test set, and reviewed for drift is better than guessing.

But the scores are incomplete. They’re telling you how your AI product performs against an evaluator’s interpretation of a rubric, tested on a dataset that was built before any real users arrived. The distribution of real user queries will differ from your test set. The evaluator’s calibration will drift. The only way to find out whether the evaluator’s judgment aligned with actual user experience is to run production experiments and check.

The framing the course eventually lands on is that AI evals are a feedback loop rather than a gate. You evaluate before release to catch obvious failures and build evidence for go/no-go decisions. You evaluate in production to find out what the pre-release evals missed, and to calibrate the tools you’ll use for the next round. Neither phase is optional, and neither gives you certainty.

I’ve just finished the course, and I loved it. It has left me with a framework, and the insight that which parts of it I lean on will produce different outcomes — and that’s fine, because different product contexts call for different trade-offs. What it’s also left me with is a different relationship with the scores themselves. Evals are the single most underused tool in AI-enabled application testing right now, and none of what I’ve learned here has changed that view. It’s made me want to hold the scores with more care, not less confidence in the practice. A passing eval suite for a deterministic system means something clean. A passing eval suite for an AI product means: the evaluator, given this rubric, scored these outputs above the threshold we set, against this test set, as of this run. That’s worth knowing. It’s just not the same thing.

The meta-loop, evaluating the evaluator, isn’t a flaw in the process. It is the process.