A model can lead a public leaderboard and fail on your most important workflow. The right benchmark is a representative sample of the decisions your product must make.

Evaluate the task, not the reputation

Build a small set of real cases, include the uncomfortable edge cases and define which errors are acceptable. Measure precision, latency, cost and consistency.

Then review failures with people who understand the domain. An aggregate metric can hide exactly the error that causes the most damage.

Evaluation is part of the product

Every change in prompt, model or tool must be comparable against a baseline. Without that discipline, optimizing AI becomes a collection of impressions.