/blog
EvalsQuality7 min

Evals That Actually Catch Regressions

Why golden datasets, LLM-as-judge, and production traces are three different evals — and you need all three.

May 30, 2026

A passing eval suite means nothing if it doesn't model the failure modes your users actually hit.

The three layers

  • Unit evals — small golden set, run on every PR, must stay green.
  • LLM-as-judge — broader coverage, scored by a stronger model with a rubric.
  • Production sampling — replay real traces nightly, flag drift.

Pitfalls

  • Judges that grade their own output family. Use a different model.
  • Golden sets that never grow. Pipe escalated tickets straight in.
  • Pass/fail without confidence intervals. 18/20 is noise, not a win.