Treat your evals as code

If your evals live in a Jupyter notebook on someone's laptop, you do not have evals. You have a vibe.

What "as code" actually means

Eval cases live in version control next to the prompt and tool definitions.
Each case has an id, an input, an expected, and a grader.
The eval runner is a CLI you can run locally and in CI.
Results are written to a structured store, not a printed table.

A minimal case format

- id: refund_policy_basic
  input: "Can I get a refund after 30 days?"
  expected:
    must_contain: ["30-day window", "exceptions"]
    must_not_contain: ["yes, anytime"]
  grader: rubric_v2

Graders, ranked by trust

Deterministic: exact match, regex, JSON schema. Cheap, reliable, narrow.
Programmatic: run the model's code, check the output.
LLM-as-judge: a second model grades the first. Useful, but calibrate it against humans first.
Human: slow, expensive, the ground truth everything else is calibrated to.

Mix them. Don't lean on a single grader.

In CI

Block the merge if pass rate on the frozen set drops more than 2 points, or if cost-per-case rises more than 20%.

Two numbers, one gate. That is the whole policy.