The Hard Part of AI Evals Isn’t the Tooling

Gemini said An intricate digital illustration featuring a large, circular mosaic design set against a background of teal and gold cracked-tile patterns. The central focus is a series of concentric rings and spirals. A prominent white outer ring is inscribed with dark, abstract symbols resembling ancient script or runic markings. Inside this, a spiral of blocky tiles in shades of teal, turquoise, and burnt orange winds toward a glowing golden-orange center. Thin gold lines branch out from the central circle across the surrounding teal mosaic, creating a map-like or biological appearance. The style combines elements of ancient artifacts with modern digital geometric art.

Additional thoughts on session three of the AI Evals and Analytics Playbook1. The first part is here2.

Every major shift in how we build and ship software has been followed by a wave of tooling that automates the tractable part and leaves the actual problem to the practitioner. Agile gave us story pointing ceremonies and JIRA boards. DevOps gave us pipelines and dashboards. Neither solved the harder question underneath, the one about what we’re building and how we’ll know when it’s good enough, but both produced thriving ecosystems of tools that made it look like we were addressing it.

AI evals is doing the same thing. Braintrust, Arize, LangSmith: they all look similar because they’re solving the same manageable problem. Running evaluations at scale is a tractable engineering challenge, and the tools handle it well. The problem they don’t solve is deciding what you’re measuring, and whether your measurement means anything.

The scaffolding problem

Session three covered the four components of a quantitative eval metric: evaluator, rubric, test set, pass/fail gate. I wrote about the rubric as a test oracle in the [main post](/output/blog/ai-evals-session-3.md). What I didn’t get into is the gap between having a rubric and having a rubric that means something for your specific product.

The tools offer scaffolding: rubric templates, guided setup flows, rubric generators. The scaffolding is a reasonable offer, and it has the same problem agile “best practices” have always had: it ignores the individual context of your situation. A sprint ceremony template ignores what your team actually needs to coordinate. A rubric template ignores what your users actually care about. The template gives you a starting point. Whether it fits is a judgment call, and that judgment call is the work the tool can’t do for you.

The same applies to test data. Historical data is the gold standard because it reflects what users actually do. Synthetic data is easier to generate and tends to have a different distribution from production. Most teams reach for synthetic because it’s faster. That’s the same impulse that produces evals that pass and products that fail in the wild.

The judge isn’t neutral

There’s something the tooling conversation mostly skips over, and it came up briefly in the session in a way that has stayed with me since.

LLM-as-a-judge, meaning using one model to evaluate the outputs of another, is a reasonable solution to a real problem. Human evaluation is slow and expensive, and at scale you need automation. The tools that have converged on this approach aren’t wrong.

What most teams using LLM-as-a-judge haven’t asked is: what was the evaluator trained on, and whose definition of good has it been optimised for?

This isn’t speculative. Researchers studying LLM evaluation have documented what they call “preference leakage,” or family bias: evaluator models systematically prefer outputs from models in the same family. A GPT-4-class judge rates GPT-4-class outputs higher. A Claude-based evaluator shows similar patterns with Claude outputs. The effect is consistent enough to have a nickname, nepotism, and it means the number your evaluation pipeline produces is partly a reflection of which company trained your judge.

The “Justice or Prejudice?” paper3 identified this as one of twelve documented biases in LLM-as-a-judge systems. The tools don’t surface it. Most practitioners never think to ask about it.

The tools are fine

None of this means the tooling is bad. LLM-as-a-judge scales, and at the volumes involved, automated pipelines are genuinely better than manual review. The tools solve the tractable problem well.

The issue is that the tractable problem, running evaluations, is not the same as the hard problem, which is knowing whether your evaluations mean anything. The industry has built excellent tooling for the first while leaving the second mostly to chance.

I hear versions of the harder questions in workshops all the time, though rarely in the context of evals: what is the acceptance criterion, who owns it, and what does the evidence look like? The AI evals context doesn’t change the structure of the problem. It just adds a layer of tooling that makes it easier to skip past the questions and get straight to the numbers.

The numbers are easy. The questions are the work.

  1. AI eval and analytics playbook, Maven ↩︎
  2. Anatomy of a metric ↩︎
  3. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge (2024) ↩︎

Leave a Comment

Your email address will not be published. Required fields are marked *

Enquire about a workshop with me

Enquire about a workshop with me

Enquire about a workshop