If you can’t measure it, you can’t maintain it: evals for AI features

MIT found 95% of GenAI pilots deliver no measurable impact. METR found developers can't even tell when AI slows them down. Both point to the same missing discipline: evaluation.

Why evals matter more for AI

Traditional code is deterministic — a test passes or fails. AI output is probabilistic; "is it good enough?" becomes an argument unless you make it a number. Without evals, you can't tell if a prompt change helped, if a model upgrade regressed you, or if quality is drifting.

Building a harness

1.Curate a golden set of representative inputs with known-good outputs (and known-hard edge cases).
2.Pick metrics that match the task — exact match, rubric scoring, an LLM-as-judge with a clear rubric, or human review for the ambiguous slice.
3.Set a target before you build. "90% on the golden set" turns opinion into a finish line.
4.Run evals in CI so every prompt or model change is scored automatically.

Evals are to AI features what tests are to software. Shipping without them is shipping blind — and blind is how pilots become the 95%.

Sources

MIT NANDA — The GenAI Divide
METR — Developer productivity RCT

Written by ivector

Start a project →

If you can’t measure it, you can’t maintain it: evals for AI features

Why evals matter more for AI

Building a harness

Sources

Keep reading

“Attention Is All You Need”, explained for non-engineers

The paper that introduced RAG, explained simply

The METR study, explained: why AI made experienced developers slower