/blog
Model EvalsEng5 min

Treat your evals as code

Evals are not a notebook you run once before launch. They are a test suite. Here is what that looks like in practice.

May 30, 2026

If your evals live in a Jupyter notebook on someone's laptop, you do not have evals. You have a vibe.

What "as code" actually means

  • Eval cases live in version control next to the prompt and tool definitions.
  • Each case has an id, an input, an expected, and a grader.
  • The eval runner is a CLI you can run locally and in CI.
  • Results are written to a structured store, not a printed table.

A minimal case format

- id: refund_policy_basic
  input: "Can I get a refund after 30 days?"
  expected:
    must_contain: ["30-day window", "exceptions"]
    must_not_contain: ["yes, anytime"]
  grader: rubric_v2

Graders, ranked by trust

  1. Deterministic: exact match, regex, JSON schema. Cheap, reliable, narrow.
  2. Programmatic: run the model's code, check the output.
  3. LLM-as-judge: a second model grades the first. Useful, but calibrate it against humans first.
  4. Human: slow, expensive, the ground truth everything else is calibrated to.

Mix them. Don't lean on a single grader.

In CI

Block the merge if pass rate on the frozen set drops more than 2 points, or if cost-per-case rises more than 20%.

Two numbers, one gate. That is the whole policy.