Why This Is Asked
Chunking is deceptively simple to get wrong. Bad chunking breaks up coherent concepts, loses important context, or creates chunks too large to retrieve precisely. Interviewers use this question to see if you think carefully about the data, not just the algorithm.
Key Concepts to Cover
- Fixed-size chunking — splitting by character or token count
- Sentence/paragraph chunking — splitting on natural boundaries
- Semantic chunking — splitting when topic changes
- Document-structure-aware chunking — using headings, sections, tables
- Chunk overlap — preserving context across chunk boundaries
- Metadata enrichment — attaching source info to each chunk
- Code-specific chunking — function/class boundaries
How to Approach This
1. Why Chunking Matters
The goal of chunking is to create units that are:
- Small enough to be retrieved precisely
- Large enough to contain meaningful context
- Self-contained enough to be understood without surrounding text
There is no universal right chunk size — it depends on the document type and retrieval task.
2. Strategy by Document Type
Plain text / articles:
- Recursive character splitting: try paragraphs first, then sentences, then words
- Typical chunk size: 256-512 tokens with 50-token overlap
- Include document title and section heading as metadata
PDFs:
- Use a PDF parser that preserves layout (PyMuPDF, pdfplumber)
- Split on page boundaries first, then recursively
- Handle tables separately — extract as structured data
Code:
- Never split mid-function or mid-class
- Use AST-aware chunking: split at function, class, or module boundaries
- Include file path and language as metadata
HTML / web pages:
- Strip navigation, ads, boilerplate before chunking
- Split on heading tags (h1, h2, h3) to preserve structure
- Include breadcrumb/URL path as metadata
Structured data (CSV, JSON):
- Embed as natural language descriptions
- "Row 42: Customer John Smith, order total $245.00, status: shipped"
3. Chunk Overlap
Always add overlap (10-20% of chunk size) to avoid losing information at boundaries.
4. Metadata is Part of the Chunk
Every chunk should carry metadata:
{
"text": "chunk content here...",
"source": "product-docs/api-reference.pdf",
"page": 12,
"section": "Authentication",
"document_id": "abc123"
}
Common Follow-ups
-
"How do you choose the right chunk size?" Empirically: test different sizes on your eval dataset and measure retrieval precision@k. Common starting points: 256 tokens for precise retrieval, 512 for richer context.
-
"What is late chunking and when would you use it?" In standard RAG, you split a document into chunks and embed each chunk independently — so each chunk's embedding has no awareness of the rest of the document. Late chunking (introduced by JinaAI, 2024) instead runs the full document through the embedding model in a single forward pass, producing contextually-aware token-level embeddings where each token has attended to the whole document. You then apply mean-pooling within your chunk boundaries to produce chunk-level vectors. Because the token representations were shaped by full-document context before pooling, the resulting chunk embeddings carry more global context than independently embedded chunks. Useful when document-wide context significantly affects the meaning of individual sections — such as scientific papers where the abstract and conclusions inform the methods section.
-
"How would you handle documents with very short sections?" Combine short sections into larger chunks to hit target token count. Or embed each Q&A pair individually but add the question as metadata for the answer chunk.