Most agent failures aren't reasoning failures — they're scope failures. The planner over-commits, the executor improvises, and three tool calls later you're paying for a 40k-token loop that produces nothing.
The split
- Planner: pure reasoning, no side effects, emits a typed plan.
- Executor: deterministic dispatcher, runs one step, returns a result.
- Critic: validates each step against the plan and short-circuits drift.
What we learned shipping this
- Make the plan a schema, not prose. JSON schemas with
enumstep types cut hallucinated tools to near-zero. - Cap retries per step, not per run. Runs that loop forever almost always loop on the same broken step.
- Log the delta between plan and execution. That's your eval gold.