Retrieval quality is mostly a chunking problem in disguise. Embeddings and rerankers can only work with the pieces you hand them.
The three questions
- What is one answer-sized unit in this corpus? A paragraph? A function? A row?
- What context does the reader need around that unit to make sense of it?
- What is the smallest chunk that still contains both?
Your chunk size is the answer to question 3.
Defaults that usually work
- Prose docs: 500–800 tokens, 10–15% overlap, split on headings first.
- Code: split by symbol (function, class), never by line count.
- Tables: keep the header row with every chunk of the body.
- Chat logs: chunk by conversation turn, not by token window.
Things that quietly break retrieval
- Splitting mid-sentence. Always snap to a sentence or block boundary.
- Throwing away the document title. Prepend it to every chunk.
- Mixing languages in one index without a language tag in metadata.
Metadata is half the system
{
"doc_id": "handbook-2026",
"section": "Engineering > On-call",
"heading_path": ["On-call", "Escalation"],
"updated_at": "2026-04-12"
}
Filter on metadata first, embed second. It is faster and almost always more accurate.
Measure, don't vibe
Build a small eval set of 50 real questions with known correct passages. Re-run it every time you change chunking. If recall@5 doesn't move, neither should your config.