If your evals live in a Jupyter notebook on someone's laptop, you do not have evals. You have a vibe.
What "as code" actually means
- Eval cases live in version control next to the prompt and tool definitions.
- Each case has an
id, aninput, anexpected, and agrader. - The eval runner is a CLI you can run locally and in CI.
- Results are written to a structured store, not a printed table.
A minimal case format
- id: refund_policy_basic
input: "Can I get a refund after 30 days?"
expected:
must_contain: ["30-day window", "exceptions"]
must_not_contain: ["yes, anytime"]
grader: rubric_v2
Graders, ranked by trust
- Deterministic: exact match, regex, JSON schema. Cheap, reliable, narrow.
- Programmatic: run the model's code, check the output.
- LLM-as-judge: a second model grades the first. Useful, but calibrate it against humans first.
- Human: slow, expensive, the ground truth everything else is calibrated to.
Mix them. Don't lean on a single grader.
In CI
Block the merge if pass rate on the frozen set drops more than 2 points, or if cost-per-case rises more than 20%.
Two numbers, one gate. That is the whole policy.