Skip to content
← Back to blog
Engineering·May 26, 2026·5 min read

If you can’t measure it, you can’t maintain it: evals for AI features

The single discipline that separates AI pilots that reach production from the 95% that don’t is evaluation. Here’s how to build it.

MIT found 95% of GenAI pilots deliver no measurable impact. METR found developers can't even tell when AI slows them down. Both point to the same missing discipline: evaluation.

Why evals matter more for AI

Traditional code is deterministic — a test passes or fails. AI output is probabilistic; "is it good enough?" becomes an argument unless you make it a number. Without evals, you can't tell if a prompt change helped, if a model upgrade regressed you, or if quality is drifting.

Building a harness

  1. 1.Curate a golden set of representative inputs with known-good outputs (and known-hard edge cases).
  2. 2.Pick metrics that match the task — exact match, rubric scoring, an LLM-as-judge with a clear rubric, or human review for the ambiguous slice.
  3. 3.Set a target before you build. "90% on the golden set" turns opinion into a finish line.
  4. 4.Run evals in CI so every prompt or model change is scored automatically.
Evals are to AI features what tests are to software. Shipping without them is shipping blind — and blind is how pilots become the 95%.

Sources

Written by ivector
Start a project →