Compare chunking strategies for different document types — PDFs, code, HTML, and tables — and learn when each approach works best.

Learn RAG chunking strategies for PDFs, code, HTML, and structured data. Covers fixed-size, semantic, recursive, and document-aware chunking.

RAG Chunking Strategies for Different Document Types

Why This Is Asked

Chunking is deceptively simple to get wrong. Bad chunking breaks up coherent concepts, loses important context, or creates chunks too large to retrieve precisely. Interviewers use this question to see if you think carefully about the data, not just the algorithm.

Key Concepts to Cover

Fixed-size chunking — splitting by character or token count
Sentence/paragraph chunking — splitting on natural boundaries
Semantic chunking — splitting when topic changes
Document-structure-aware chunking — using headings, sections, tables
Chunk overlap — preserving context across chunk boundaries
Metadata enrichment — attaching source info to each chunk
Code-specific chunking — function/class boundaries

How to Approach This

1. Why Chunking Matters

The goal of chunking is to create units that are:

Small enough to be retrieved precisely
Large enough to contain meaningful context
Self-contained enough to be understood without surrounding text

There is no universal right chunk size — it depends on the document type and retrieval task.

2. Strategy by Document Type

Plain text / articles:

Recursive character splitting: try paragraphs first, then sentences, then words
Typical chunk size: 256-512 tokens with 50-token overlap
Include document title and section heading as metadata

PDFs:

Use a PDF parser that preserves layout (PyMuPDF, pdfplumber)
Split on page boundaries first, then recursively
Handle tables separately — extract as structured data

Code:

Never split mid-function or mid-class
Use AST-aware chunking: split at function, class, or module boundaries
Include file path and language as metadata

HTML / web pages:

Strip navigation, ads, boilerplate before chunking
Split on heading tags (h1, h2, h3) to preserve structure
Include breadcrumb/URL path as metadata

Structured data (CSV, JSON):

Embed as natural language descriptions
"Row 42: Customer John Smith, order total $245.00, status: shipped"

3. Chunk Overlap

Always add overlap (10-20% of chunk size) to avoid losing information at boundaries.

4. Metadata is Part of the Chunk

Every chunk should carry metadata:

{
  "text": "chunk content here...",
  "source": "product-docs/api-reference.pdf",
  "page": 12,
  "section": "Authentication",
  "document_id": "abc123"
}

Common Follow-ups

"How do you choose the right chunk size?" Empirically: test different sizes on your eval dataset and measure retrieval precision@k. Common starting points: 256 tokens for precise retrieval, 512 for richer context.
"What is late chunking and when would you use it?" In standard RAG, you split a document into chunks and embed each chunk independently — so each chunk's embedding has no awareness of the rest of the document. Late chunking (introduced by JinaAI, 2024) instead runs the full document through the embedding model in a single forward pass, producing contextually-aware token-level embeddings where each token has attended to the whole document. You then apply mean-pooling within your chunk boundaries to produce chunk-level vectors. Because the token representations were shaped by full-document context before pooling, the resulting chunk embeddings carry more global context than independently embedded chunks. Useful when document-wide context significantly affects the meaning of individual sections — such as scientific papers where the abstract and conclusions inform the methods section.
"How would you handle documents with very short sections?" Combine short sections into larger chunks to hit target token count. Or embed each Q&A pair individually but add the question as metadata for the answer chunk.

How Do You Handle Chunking Strategies for Different Document Types?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Why Chunking Matters

2. Strategy by Document Type

3. Chunk Overlap

4. Metadata is Part of the Chunk

Common Follow-ups

Related Questions

Design a RAG Pipeline from Scratch

How Would You Evaluate Retrieval Quality in a RAG System?

Design a Document Q&A System for a Large Corpus

Prep for the full interview loop