Chunking Strategies — RAG Engineering: Building AI That Knows Your Data

Why Chunking Matters

You cannot embed an entire document as a single vector and expect good retrieval. A 50-page PDF compressed into one embedding loses the specificity needed to match precise questions. Conversely, embedding individual sentences loses the context needed for the LLM to generate a coherent answer.

Chunking is the process of splitting documents into smaller segments that are each embedded independently. The goal is to create chunks that are small enough to be semantically focused (one topic per chunk) but large enough to carry sufficient context for the LLM to use them effectively.

Getting chunking right is one of the highest-leverage optimizations in any RAG system. A poorly chunked knowledge base will consistently retrieve irrelevant or incomplete information, no matter how good your embedding model or retrieval algorithm is.

Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed number of characters, with some overlap between consecutive chunks.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,      # Maximum characters per chunk
    chunk_overlap=200,    # Characters shared between chunks
    length_function=len,
)

text = "Your long document text here..."
chunks = splitter.split_text(text)

When to use: Quick prototyping, uniform documents with no clear structure, initial baseline before trying more sophisticated methods.

Limitations: Chunks can split mid-sentence or mid-paragraph, breaking semantic coherence. A chunk about "employee benefits" might end halfway through the vacation policy, making it useless for answering questions about PTO.

Recursive Character Splitting

This is the most commonly used splitter in LangChain and the default recommendation for most RAG systems. It tries to split on paragraph boundaries first, then sentences, then words, recursively falling back to smaller separators.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],  # Priority order
    length_function=len,
)

chunks = splitter.split_text(document_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(chunk[:100])
    print("---")

The separators list defines the priority: first try to split on double newlines (paragraph boundaries), then single newlines, then sentences, then words. This preserves semantic boundaries as much as possible.

When to use: General-purpose RAG systems, most document types, when you want a reliable default.

Semantic Chunking

Instead of splitting by character count, semantic chunking splits based on meaning. It computes embeddings for sentences and places chunk boundaries where the semantic similarity between consecutive sentences drops significantly.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split at the top 5% dissimilarity
)

chunks = splitter.split_text(document_text)

When to use: Documents that cover multiple topics with no clear formatting, conversational transcripts, long-form articles where topic shifts are not marked by headers.

Limitations: Slower than character-based splitting (requires embedding every sentence), more expensive (API calls for embeddings), and the results can be unpredictable for short documents.

Sentence-Based Splitting

Split text into individual sentences or groups of sentences. This produces very precise chunks but may lack context.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configure to split primarily on sentence boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", "? ", "! ", " ", ""],
)

chunks = splitter.split_text(document_text)

When to use: FAQ databases (each question-answer pair is naturally a chunk), legal documents where individual clauses matter, any content where precision is more important than context.

Parent-Child Chunking

This is one of the most powerful strategies for production RAG. The idea: use small chunks for retrieval (they match queries precisely) but return larger parent chunks for context (they give the LLM enough information to generate a good answer).

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
)

# Larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

# Set up the retriever
vectorstore = Chroma(
    collection_name="split_parents",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents -- children are embedded, parents are stored
retriever.add_documents(documents)

# Query -- retrieves by child similarity, returns parent chunks
results = retriever.invoke("What is the vacation policy?")

When to use: Production systems where retrieval precision and answer quality both matter, documents with sections that have clear parent-child relationships (chapters containing subsections).

Markdown and Header-Based Splitting

For structured documents (Markdown, HTML, documentation), splitting on headers preserves the natural structure of the content.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "header_1"),
    ("##", "header_2"),
    ("###", "header_3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,  # Keep headers in chunk text
)

chunks = splitter.split_text(markdown_text)

for chunk in chunks:
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}")
    print("---")

When to use: Documentation, knowledge base articles, any content where headings reliably indicate topic boundaries.

Chunk Overlap: How Much?

Overlap ensures that information near a chunk boundary is not lost. If a key sentence falls right at the split point, overlap guarantees it appears in at least one complete chunk.

Rules of thumb:

10-20% of chunk size is the standard overlap. For 1000-character chunks, use 100-200 characters of overlap.
Too little overlap risks losing context at boundaries.
Too much overlap wastes storage and can cause duplicate retrieval results.
Zero overlap is acceptable for naturally bounded chunks (like FAQ entries or code functions) where boundaries are clean.

Optimal Chunk Sizes

There is no universal best chunk size -- it depends on your use case, embedding model, and question types.

General guidelines:

| Use Case | Chunk Size | Rationale | |----------|-----------|-----------| | FAQ / short answers | 200-500 chars | Questions map to specific, concise answers | | Documentation | 500-1000 chars | Sections need enough context for explanation | | Legal / policy | 1000-2000 chars | Clauses and policies need full context | | Code | By function/class | Natural boundaries define chunks | | Long-form articles | 800-1500 chars | Balance between specificity and context |

Tip: The best way to find your optimal chunk size is to experiment. Create chunk sets at 500, 1000, and 1500 characters, run the same set of test queries against each, and measure which produces the most relevant results. The difference is often dramatic.

Common Chunking Mistakes

Splitting code mid-function. Character-based splitters do not understand code structure. A function split across two chunks will be meaningless to both the embedding model and the LLM. Use AST-based or function-boundary splitting for code.

Ignoring document structure. If your documents have headers, use them. Splitting a well-structured document on character count alone throws away valuable structural information.

Chunks that are too small. A 100-character chunk like "See section 4.2 for details" is useless -- it has no information content. Set minimum chunk sizes and filter out fragments.

No metadata propagation. When you split a document into 50 chunks, each chunk should carry the original document's metadata (source, title, date). Without this, you cannot filter by source or cite properly.

Using one strategy for everything. Different content types need different chunking strategies. Code should be split by function. FAQs should be split by question-answer pair. Documentation should be split by section. Build your pipeline to handle each type appropriately.

In the next lesson, you will learn how to search these chunks effectively using advanced retrieval techniques that go far beyond basic similarity search.