/lab · what's on the bench

What is on the bench.

Technical work in progress — across agentic systems, retrieval pipelines, evaluation harnesses and the model layer underneath. Some of it ships into apps. Some of it doesn't.

/active areas

Where we spend our research time.

Agentic systems ongoing

Planner-executor agents with explicit tool budgets.

Most agent demos collapse the moment they meet real users. Our current architecture splits planning and execution, budgets every tool call, and surfaces a transcript the user can step through.

PlanningTool useTraces

Retrieval ongoing

Hybrid retrieval with semantic-boundary chunking.

Vector search plus BM25, reranked with a cross-encoder. Chunks respect document structure — sections, headings, list boundaries. Boring. Effective.

RAGRerankingChunking

Evaluation ongoing

Golden sets and regression alarms per app.

Each app maintains ~500 golden examples with expected behaviour. Every release runs the full set; drift triggers an alarm. No silent regressions.

EvalsRegression

Model layer ongoing

Per-request routing across frontier and open-weight models.

A routing layer picks the right frontier or small open-weight model per request based on task complexity, latency target and cost budget.

RoutingCost

/bets · before they are obvious

Things we believe before they are obvious.

Opinionated views shaping what we build next. Each one is a working hypothesis — happy to be wrong.

bet.01

Agents will work — narrowly

General-purpose agents will continue to disappoint. Tightly scoped agents with tool budgets, retrieval grounding and human-in-the-loop fallback will quietly ship.

bet.02

Retrieval is forever

Context windows keep growing. Retrieval doesn't go away — it just moves up a layer. Knowing which 1% of the corpus to feed the model stays the central problem.

bet.03

Evals are the moat

Anyone can call an API. What separates production-grade work is the willingness to evaluate it. Eval harnesses age better than prompts.

bet.04

Multimodal is table stakes

Voice, image, structured data — users won't think of these as separate input modes for much longer. We design every app multi-modal from day one.

bet.05

Latency is a feature

The model that gets there in 800 ms beats the model that gets there in 4 s with marginally better output. We optimise for the felt speed.

bet.06

Privacy will compound

Apps that handle PHI, financials and personal documents will need real privacy posture. We are building toward residency-friendly, audit-ready defaults.

/talk to us

Working on something similar?

We trade notes with engineers and founders working on agents, retrieval and applied AI. Drop a line — happy to compare playbooks.

Start a conversation →Read the blog