MIT found 95% of GenAI pilots deliver no measurable impact. METR found developers can't even tell when AI slows them down. Both point to the same missing discipline: evaluation.
Why evals matter more for AI
Traditional code is deterministic — a test passes or fails. AI output is probabilistic; "is it good enough?" becomes an argument unless you make it a number. Without evals, you can't tell if a prompt change helped, if a model upgrade regressed you, or if quality is drifting.
Building a harness
- 1.Curate a golden set of representative inputs with known-good outputs (and known-hard edge cases).
- 2.Pick metrics that match the task — exact match, rubric scoring, an LLM-as-judge with a clear rubric, or human review for the ambiguous slice.
- 3.Set a target before you build. "90% on the golden set" turns opinion into a finish line.
- 4.Run evals in CI so every prompt or model change is scored automatically.
Evals are to AI features what tests are to software. Shipping without them is shipping blind — and blind is how pilots become the 95%.
Sources
- MIT NANDA — The GenAI Divide
- METR — Developer productivity RCT
Written by ivector
Start a project →