Chunking — The Art of Splitting Text
Up to this point in the series, we’ve taken something important for granted: that our documents have been split into smaller pieces and each piece has been embedded. But the key question is: how do we split documents?
This question seems simple but it’s one of the most important decisions you’ll make when designing a RAG system. Bad chunking = bad search results = bad answers. It’s that simple.
Why Is Chunking Even Necessary?
Three main reasons:
1. Context Window Limitation: If you feed a complete 50-page document to an LLM, it might not fit in the context window. Even if it does, the model tends to ignore information in the middle (Lost in the Middle).
2. Search Accuracy: If you embed a 50-page document as a single unit, its vector becomes an “average” of all the content. This vector isn’t very close to any specific topic in the document. It’s like asking someone “What’s your specialty?” and they say “Everything.” Someone who clearly states their specialty is more useful.
3. Cost Efficiency: When you only send relevant chunks to the LLM (not the entire document), you consume fewer tokens and API costs decrease.
Chunking is like chopping ingredients before cooking. If you throw a whole potato into the pot, it cooks slowly and unevenly. But if you chop it up, it cooks faster and absorbs flavors better. Of course, if you mince it too finely, it turns to mush and loses its form.
The Core Trade-off: Chunk Size
Chunk size is one of the most important parameters. Let’s look at two extremes:
Very Small Chunks (e.g., 50 words):
- Advantage: Very precise search. Each chunk’s vector represents a specific topic.
- Problem: Context Loss. A single sentence might be meaningless without the preceding paragraph.
- Problem: Too many chunks. Search becomes slower and memory usage increases.
Very Large Chunks (e.g., 2000 words):
- Advantage: Context is preserved. Each chunk has complete information.
- Problem: Search accuracy drops. The vector of a large chunk is an average of different topics.
- Problem: More tokens consumed.
Rule of Thumb: For most use cases, chunks of 200 to 500 words (roughly 250 to 1000 tokens) are a good starting point. But you must test and tune with your actual data.
Chunking Strategies
Now let’s explore different strategies. We’ll start simple and move toward more complex ones:
1. Fixed-Size Chunking
The simplest method: split text into pieces with a fixed number of characters or tokens.
def fixed_size_chunk(text, chunk_size=500, overlap=50):
"""Split text into fixed-size chunks"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Overlap to preserve context
return chunks
text = "Your long text..."
chunks = fixed_size_chunk(text, chunk_size=500, overlap=50)
Pros: Simple, fast, predictable.
Cons: May cut in the middle of a sentence or even a word! Ignores text structure.
2. Sentence-Based Chunking
Split text into sentences, then group sentences until reaching the desired size.
import nltk
nltk.download('punkt')
def sentence_chunk(text, max_chunk_size=500):
"""Split based on sentences"""
sentences = nltk.sent_tokenize(text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
if current_size + len(sentence) > max_chunk_size and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_size = 0
current_chunk.append(sentence)
current_size += len(sentence)
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Pros: Never cuts mid-sentence. More natural.
Cons: Variable chunk sizes. Sentence tokenizers may not work well for all languages.
3. Recursive Character Splitting
This approach was popularized by LangChain. The idea: first try splitting with large separators (like double newlines). If chunks are still too large, continue with smaller separators (single newline, period, space).
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", "?", "!", " ", ""]
)
chunks = splitter.split_text(long_text)
Pros: Preserves text structure as much as possible. Very flexible.
Cons: Requires tuning separators for each language.
Recommendation: RecursiveCharacterTextSplitter is the best starting point for most projects. It's simple, performs well, and is easy to configure.
4. Semantic Chunking
The most advanced method: use embeddings to detect where the topic changes. Wherever semantic similarity between consecutive sentences drops, that's where you split.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = chunker.split_text(long_text)
Pros: Each chunk has a coherent topic. Best search quality.
Cons: Slow (requires embedding each sentence). Costly. More complex.
Overlap — Why It Matters
Overlap means some content at the end of each chunk is shared with the beginning of the next chunk. Why?
Suppose a paragraph is split like this:
Chunk 1: "... GPT-4 has many capabilities. One of the most important"
Chunk 2: "is image processing which wasn't in the previous version. This feature ..."
Without overlap, the information "GPT-4 has image processing capability" isn't complete in either chunk. But with overlap:
Chunk 1: "... GPT-4 has many capabilities. One of the most important is image processing"
Chunk 2: "One of the most important is image processing which wasn't in the previous version. This feature ..."
Now the complete information exists in at least one chunk.
How much overlap? Usually 10 to 20 percent of the chunk size. So if a chunk is 500 characters, 50 to 100 characters of overlap is appropriate.
Metadata — The Hidden Power of Chunking
Should each chunk contain only text? No! Adding metadata to each chunk is very important:
chunk = {
"text": "Chunk text...",
"metadata": {
"source": "installation_guide.pdf",
"page": 5,
"section": "Prerequisites",
"chunk_index": 3,
"total_chunks": 12,
"title": "Software Installation Prerequisites",
"date": "2024-01-15",
"language": "en"
}
}
Metadata helps in several ways:
- Filtering: Search only chunks from a specific document or category
- Attribution: Tell users which page of which document the answer came from
- Context reconstruction: If a chunk is small, add the section title to preserve context
- Ranking: Give higher priority to newer documents
Advanced Techniques
Parent-Child Chunking:
The idea: Small chunks for searching, but larger chunks for sending to the LLM. When a small, relevant chunk is found, send the larger chunk that contains it (the Parent) to the model. This way, search accuracy is high and the model has sufficient context.
# Parent Chunk (large, for LLM)
parent = "Complete paragraph with all details..."
# Child Chunks (small, for search)
children = [
{"text": "First sentence...", "parent_id": "parent_1"},
{"text": "Second sentence...", "parent_id": "parent_1"},
{"text": "Third sentence...", "parent_id": "parent_1"},
]
# When searching: find the child, send the parent to LLM
Contextual Chunking:
Add the section title or document summary to the beginning of each chunk:
# Without Context
chunk = "This feature was added in version 3.2 and ..."
# With Context (better!)
chunk = "[Document: Software Guide | Section: New Features] This feature was added in version 3.2 and ..."
This way, even if the chunk stands alone, its context is clear.
Practical Chunking Checklist
Before finalizing your chunking strategy, check this list:
- Is the chunk size appropriate for the content type? (FAQ shorter, articles longer)
- Do you have sufficient overlap? (10-20%)
- Is any chunk cut mid-sentence?
- Is sufficient metadata added to each chunk?
- Have you tested with a few real examples?
- Are very small chunks (under 50 words) filtered out?
Testing and Tuning
No chunking strategy is "the best." Everything depends on your data and use case. So:
1. Test multiple strategies: Try several different strategies with your actual data.
2. Have evaluation criteria: For example, prepare 50 questions with correct answers. Evaluate how many correct answers each strategy produces.
3. Visual inspection: Read a few chunks manually. Is each chunk meaningful and understandable on its own?
4. Iterate: Chunking is an iterative process. Start with something simple, then improve based on results.
Common Mistake: Many people spend hours tuning the LLM (Prompt Engineering, Temperature, ...) but rush through Chunking. Yet most RAG problems are rooted in bad chunking. If relevant information isn't found, even the best LLM in the world can't give a good answer.
Summary
In this episode you learned:
- Chunking is essential for search accuracy, context window limitations, and cost efficiency
- Chunk size is a trade-off between precision and context (200-500 words is a good starting point)
- Four main strategies: Fixed-Size, Sentence-Based, Recursive, and Semantic
- Overlap prevents important information from being cut off
- Add metadata to each chunk (source, page, section)
- Always test and tune with real data
These first five episodes have built the foundations of RAG for you. Now you know why RAG is needed, how it works, what Embedding is, how Vector Databases search, and how to split text. In upcoming episodes, we'll move on to practical implementation and more advanced techniques. Get ready!