Evaluating LLMs beyond the benchmark

A model can lead a public leaderboard and fail on your most important workflow. The right benchmark is a representative sample of the decisions your product must make.

Evaluate the task, not the reputation

Build a small set of real cases, include the uncomfortable edge cases and define which errors are acceptable. Measure precision, latency, cost and consistency.

Then review failures with people who understand the domain. An aggregate metric can hide exactly the error that causes the most damage.

Evaluation is part of the product

Every change in prompt, model or tool must be comparable against a baseline. Without that discipline, optimizing AI becomes a collection of impressions.

Evaluate the task, not the reputation

Evaluation is part of the product

Steven Vallejo

Related essays

Agentic AI is not a chatbot with tools

Designing systems that survive success

Observability that explains, not just alerts

One useful idea, no noise.