The AI evals field chose a flawed tool and stuck with it

Session one left me with two things I hadn’t resolved.¹ The first was a line the instructor said almost in passing: “the hard part is scalability, not automation.” I wrote it down because it piqued something, but I couldn’t quite work out what problem it was pointing at. The second was a question I kept turning over: if LLM-as-a-judge carries the documented flaws session one described, evaluator models preferring outputs from their own family and scores drifting without careful rubric design, why is it the dominant approach in every eval tool I looked at after that session?

Session two addressed both, one more thoroughly than the other.

You have done this before

The first substantive topic in session two was the sniff test: the recommended first step before any formal evaluation, designed to screen for obvious problems quickly before committing to the heavier quantitative work that follows.

The way the course described it was immediately familiar. You identify your most critical user paths by frequency multiplied by impact, run a timebox exercise keeping it narrow rather than trying to cover everything, and define clear pass/fail gates with a particular focus on what must not fail. Those are your showstoppers. If the product violates any of them, you stop and fix before going further.

This is exploratory testing with a specific scope and a go/no-go gate at the end. If you have run smoke tests before a release, you have done a version of this. The course gave it a new name and positioned it at the start of a formal evaluation process, but the underlying shape is the same. The goal is to “fail early and de-risk,” which is also just how most experienced testers already think about where to put effort first.

The overwhelm I had felt about where to start with AI evals shifted when I recognised that pattern. The terminology around AI evals can make the discipline sound like a new specialism requiring new skills from the ground up. Session two suggested that the entry point, at least, is closer to existing testing knowledge than the framing implies.

The course also makes clear that if you are involved in planning an AI product early enough, you may not need a sniff test at all. The eval framework can drive the product design rather than trailing behind it. The sniff test is what you do when you are handed something that already exists and asked whether it is ready. That is, unfortunately, probably how most testers will first encounter AI evals.

Most AI initiatives fail before the first eval runs

Session two opened with a claim that grounded everything that followed: many AI initiatives fail because nobody wrote down what the product was supposed to do, regardless of whether the underlying model was suitable for the task. The 70-to-90% failure-to-scale figure from session one reappeared here, and this time the explanation was a gap in the product requirements document. Many AI products skip the requirements step entirely and go straight to building.

The evals conversation, if you have it early enough, forces the requirements conversation: what are the critical scenarios, what must the product never do, what does “working” look like and against what baseline. These are the questions you have to answer before you can design a meaningful eval, and they are also the questions a product team has to answer before they can build something coherent.

This connects to another thread in the session: who owns the eval, and who is on the eval team. The course’s answer is that AI evals are not a single person’s job, and assembling the right team turns out to be harder than it sounds. You need product, UX, engineering, and subject matter experts. The SMEs are consistently the hardest to find, particularly in regulated domains where the definition of “correct” or “safe” requires domain knowledge that most engineering teams do not have in-house.

For testers who have spent time trying to get access to the right people in order to do meaningful work, this will sound familiar. The eval team composition problem is the requirements access problem, in a new setting.

The field picked LLM-as-a-judge knowing it’s flawed

Session one documented the problems with LLM-as-a-judge clearly. “Preference leakage” is the documented tendency for evaluator models to favour outputs from their own model family. Rubric quality matters enormously, and scores drift when rubrics are loosely defined. If the judge is unreliable, the metrics you derive from it are built on unstable ground.

I left that session expecting session two to complicate the picture further or propose an alternative. Instead, session two confirmed that LLM-as-a-judge is the approach the field has converged on, across Braintrust, Arize, LangSmith, and the other tools that all ended up looking similar to each other.

The reason, once stated, is obvious: human evaluation at scale is harder. The automation itself is not the hard part: maintaining meaningful evaluation signal as volume grows is, and LLM-as-a-judge at least gives you something that runs quickly and cheaply at scale, even if it introduces its own problems. The note from session one that had confused me, “the hard part is scalability, not automation,” made sense once session two explained why.

The way the field handles this is designing around the bias rather than solving it: using multiple evaluators from different model families, and asking for a “reason” alongside each score to catch drift and anchor the model’s judgement. It is an imperfect system the field has learned to manage carefully.

That is a more honest answer than I expected, and probably a more practically useful one.

Session two closed with an introduction to quantitative evals, the more systematic measurement work that follows a clean sniff test. I will write that up next. What I am still working out is how to set go/no-go thresholds with any rigour: when a product scores 78% on a given metric, what tells you whether that is good enough to ship?

This is the second post in a series. The first, What I don’t understand about AI evals (yet) covers the foundations from session one: why great models don’t automatically make great products, why benchmark scores are less reliable than they look, and the first questions I couldn’t quite answer about LLM-as-a-judge. ↩︎

You have done this before

Most AI initiatives fail before the first eval runs

The field picked LLM-as-a-judge knowing it’s flawed

Leave a Comment Cancel Reply

Enquire about a workshop with me

Enquire about a workshop with me

Enquire about a workshop