Chunking Strategies for RAG: Why Chunk Size Makes or Breaks Retrieval
· by Andergrove Software
When a retrieval-augmented generation (RAG) system gives a bad answer, the model usually takes the blame. But trace the failure back and the story is almost always the same: the right information existed, and retrieval never surfaced it — because of how the documents were cut into chunks months earlier. Chunking is the least glamorous stage of the pipeline and the highest-leverage knob in it. Here is how the main strategies compare, and how to pick sizes with evidence instead of folklore.
Why chunking exists at all
Three constraints force it. Embeddings average meaning over their input: one vector for a 40-page manual is "about" everything in it and therefore close to nothing in particular, so long documents must be split before they can be found (see embeddings and cosine similarity explained for why). The model’s context budget is finite and billed: you can only stuff so many retrieved passages into a prompt, so each one needs to earn its tokens. And answers live at paragraph scale: the user asked about one refund rule, not the whole policy manual. Chunking is how you make the unit of retrieval match the unit of answering.
The core trade-off: precision vs. context
Every chunking decision is the same tension viewed from different angles. Small chunks produce sharp, specific embeddings that match queries precisely — and then arrive at the prompt as orphans, a sentence stripped of the caveat two lines above it. Large chunks carry their own context — and their embeddings blur into topic soup, they match queries loosely, and they burn prompt budget on padding. Retrieval quality pulls toward small; answer quality pulls toward big. Every strategy below is an attempt to cheat that trade-off rather than pick a side.
Fixed-size with overlap: the baseline that mostly works
The default recipe: split into windows of roughly 200–500 tokens with 10–20% overlap between consecutive chunks. Overlap exists because ideas straddle boundaries — without it, a definition that ends one chunk and a rule that starts the next are each half-useless; with it, at least one chunk holds the complete thought. Two details matter more than they look. Measure in tokens, not characters — token counts are what the embedding model and the prompt budget actually see. And split on natural boundaries (sentence ends, paragraph breaks) rather than mid-word at exactly N tokens; a chunk that starts mid-sentence embeds noise. The text chunker lets you preview exactly how a document splits under different sizes and overlaps before you embed anything.
Structure-aware chunking: let the document tell you where to cut
Real documents already have semantic boundaries — headings, sections, list items — and cutting along them beats any fixed window, because the author already grouped related ideas for you. For markdown or HTML, split on heading levels and keep each section (or subsection, if sections run long) as a chunk. Two hard rules: never split a table (half a table embeds as gibberish and answers nothing) and never split a code block — treat both as atomic, even when that makes an oversized chunk. Structure-aware chunking shines on documentation, wikis, policies and contracts; it has nothing to grab onto in transcripts, chat logs or OCR soup, which is where the fixed-size baseline remains the honest choice.
Semantic chunking and retrieve-small, feed-big
Two fancier techniques earn their complexity in the right circumstances. Semantic chunking embeds each sentence, walks the sequence, and cuts wherever similarity between neighbours drops — letting topic shifts, not token counts, place the boundaries. It costs an embedding pass at indexing time and mostly pays off on long unstructured prose. Parent-document retrieval attacks the trade-off directly: index small chunks (sharp matching), but when one is retrieved, hand the model its surrounding section or page (full context). You get small-chunk precision and big-chunk answers at the price of storing the mapping. If your system’s failure mode is "found the right place, answered from a fragment", this is the fix.
Metadata: the cheap upgrade everyone skips
A chunk that reads "the limit is 30 days from delivery" is unfindable — the words that would match a query ("refund", "digital purchases") live in the heading three levels up. The fix costs one string concatenation: prepend the breadcrumb to the chunk text before embedding, e.g. Refund policy > Digital purchases: the limit is 30 days…. Retrieval quality jumps, especially on section-heavy documents. While you’re at it: store source, position and heading as metadata for citations and filtering, and strip repeated boilerplate (headers, footers, cookie banners) before chunking — otherwise your index fills with near-identical chunks of navigation that match everything weakly.
Evaluate it like the hyperparameter it is
Chunk size is a tunable, and tunables deserve measurement, not vibes. Build a small evaluation set — 20–30 real user queries, each labelled with the passage that best answers it — and measure how often the right chunk lands in the top 3–5 results. Then read the failures diagnostically. Right document but wrong section retrieved: chunks are too big. Retrieved chunk is on-topic but missing the answer’s other half: too small, or overlap too thin. Correct chunk exists and ranks fifth behind boilerplate: a metadata or deduplication problem, not a size problem. One afternoon of this converts chunking from folklore into engineering — and the cosine similarity calculator is handy for autopsying any individual surprising match.
A sensible starting recipe
For a typical documentation- or policy-style corpus: split structure-aware on headings; within long sections, fall back to ~400 tokens with 15% overlap; keep tables and code blocks atomic; prepend the heading breadcrumb; strip boilerplate; store source metadata. Preview the cut in the text chunker, embed, then spend the afternoon on the evaluation set before touching anything else — and if you want to see whether your chunks form sensible clusters, drop a sample into the embedding projector. Chunking decided at the start and never revisited is how RAG systems quietly rot; chunking treated as a measured, tunable stage is how they get good.