A passing eval suite means nothing if it doesn't model the failure modes your users actually hit.
The three layers
- Unit evals — small golden set, run on every PR, must stay green.
- LLM-as-judge — broader coverage, scored by a stronger model with a rubric.
- Production sampling — replay real traces nightly, flag drift.
Pitfalls
- Judges that grade their own output family. Use a different model.
- Golden sets that never grow. Pipe escalated tickets straight in.
- Pass/fail without confidence intervals. 18/20 is noise, not a win.