What I don't understand about AI evals (yet)

If you’ve been following me on Linkedin, or indeed have been reading my blog for a while, you’ll know that I’m interested in AI applications. So it’s probably not coming as a total surprise to you that my main learning goals this year evolve around AI evals, “which are systematic tests and measurements to that assess an AI systems performance, quality, safety, and reliability against defined criteria or benchmarks”, a quote from the course I started this last weekend.

The instructor of the AI evals and analytics course¹ opened with a set of figures I was not prepared for: 95% of AI initiatives produce zero measurable results, 70 to 90% fail to scale into recurring operations, 42% of organisations have already cut most of their AI programmes, and just 10 to 15% of AI pilots ever successfully scale to production. The 2025 McKinsey State of AI² report reaches similar conclusions: of the organisations that report any positive financial impact from AI, most say it accounts for less than 5% of their total earnings, and only around 6% report a meaningful return.

I spend a lot of my time helping QA teams get better results from AI tools, and what I see again and again is teams adopting a tool without stopping to define what “working” would look like. The outputs are noisy, the correction loop eats more time than the tool was supposed to save, and eventually the conclusion is that AI is the problem. Session one of this course made me wonder whether the problem runs deeper than how teams use the tools. What if nobody in those companies had defined what success looked like before they started?

Great model, broken product

The first thing that settled clearly in session one was a distinction that sounds obvious when you say it out loud and yet goes unspoken in most conversations about AI: a great model does not make a great product. A state-of-the-art model plugged into the wrong integration, with the wrong prompts and no defined success criteria, will fail in production, and no benchmark score will warn you it is about to happen.

QA professionals already know this logic from software, where a codebase that passes all its automated tests is not necessarily a product that users are happy with, because tests can only verify what they were designed to verify. Passing tests are not quality; they are evidence, and evidence is only as good as the questions you asked. I had just never thought to apply that frame to AI integrations, and it turns out most of the teams building those integrations have not either.

Don’t trust the leaderboard

Benchmark datasets are static, and models can be trained on the test data before they are evaluated on it, which means a leaderboard score can be as much a product of test contamination as it is of genuine capability. Different leaderboards reach different, often contradictory conclusions about which model performs best, so there is no neutral ground to stand on. I have seen the same dynamic in software testing, where 100% code coverage gets treated as a signal of quality when it is really a signal that someone wrote tests for everything, not that those tests mean anything. The metric tells you something, but not what you think it tells you.

The thing I wrote down and do not yet understand

Most AI evals tools are built around a concept called LLM-as-a-judge, which means using one AI model to evaluate the outputs of another. The appeal is obvious: if evaluation is the bottleneck, let the machine do it. The tooling has gone heavily in this direction, with Braintrust, Arize, and LangSmith all converging on essentially the same approach.

But the instructor’s point was that automation is not the hard part. Scalability is. I wrote that down because it piqued my curiosity, and I am not yet sure what it means. My instinct is that it points to something about maintaining meaningful signal at volume rather than just running more evaluations, but I am going to sit with that until session two tonight and see if I am right.

There is a related problem I am rather concerned about. If you use a model to judge another model’s outputs, and that evaluator tends to prefer outputs from models in the same family, a bias documented in a recent paper as “preference leakage”³, then who judges the judge? Our instructor Stella likened it to a russian doll, and I think that’s a great analogy.

I also did not know what RAG stood for in this context, so I looked it up. RAG is Retrieval Augmented Generation, a system architecture that searches an indexed knowledge base at query time and pulls in only the most relevant passages, rather than loading everything into the context window at once. My first instinct was that this sounded a lot like Claude Projects, where you add documents and instructions that Claude can draw on, but they work differently. A Claude Project loads everything you have added into the context for every message, whether or not it is relevant to that particular exchange, which is closer to stuffing the context window than retrieval. A true RAG system retrieves on demand: the query triggers a search, and only the matching chunks come in. The practical difference is that RAG can scale to much larger document sets but risks missing relevant passages, while a project gives the model full visibility over everything you have added, limited only by the size of the context window. Reading Stella Liu’s piece on why and what to evaluate⁴ made the implication clear: even a frontier model can fail in a RAG application if the knowledge base it is searching is outdated or biased, which circles back to the same point the session opened with. A great model does not automatically make a great product.

Three sessions left. I will write up what I learn, and what still confuses me, after each one. If you work in QA and have been wondering whether AI evals are relevant to your world, follow along.

—

AI Evals and Analytics Playbook, taught by Stella Liu and Amy Chen on Maven. ↩︎
2025 McKinsey State of AI ↩︎
Li et al., 2025 ↩︎
Why and what to evaluate, Stella Liu ↩︎

What I don’t understand about AI evals (yet)

Great model, broken product

Don’t trust the leaderboard

The thing I wrote down and do not yet understand

Leave a Comment Cancel Reply