Most agent demos die the moment a real user touches them. Here is the short list we run through before any agent goes near production traffic.
1. Define the job, narrowly
- Write the agent's job description in one sentence.
- Write three concrete tasks it must do, and three it must refuse.
- If you cannot, the scope is too big — split it.
2. Tools before prompts
Agents are mostly tools with a language model glued on top.
- Each tool has a typed schema and a single responsibility.
- Tools validate their own inputs and return structured errors, not stack traces.
- Side-effectful tools (write, send, pay) require an explicit
confirm: truearg.
3. Guardrails you can point at
| Layer | What it catches |
|---|---|
| Input filter | Prompt injection, PII you don't want logged |
| Tool allowlist | Model trying to call something it shouldn't |
| Output filter | Leaked secrets, unsafe content |
| Spend cap | Runaway loops |
4. Observability that survives 3am
logger.info("agent.step", {
run_id, step, tool, latency_ms, tokens_in, tokens_out, cost_usd
});
If you can't answer "what did the agent do for user X at 02:14?" in under a minute, you are not ready.
5. Evals before launch, evals after launch
- A frozen offline eval set that runs on every PR.
- A sampled online eval that grades real production runs daily.
- A rollback plan if either drops.
Ship small. Watch closely. Iterate.