Intermediate3 min read

How Do You Handle Chunking Strategies for Different Document Types?

Compare chunking strategies for different document types — PDFs, code, HTML, and tables — and learn when each approach works best.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

Chunking is deceptively simple to get wrong. Bad chunking breaks up coherent concepts, loses important context, or creates chunks too large to retrieve precisely. Interviewers use this question to see if you think carefully about the data, not just the algorithm.

Key Concepts to Cover

  • Fixed-size chunking — splitting by character or token count
  • Sentence/paragraph chunking — splitting on natural boundaries
  • Semantic chunking — splitting when topic changes
  • Document-structure-aware chunking — using headings, sections, tables
  • Chunk overlap — preserving context across chunk boundaries
  • Metadata enrichment — attaching source info to each chunk
  • Code-specific chunking — function/class boundaries

How to Approach This

1. Why Chunking Matters

The goal of chunking is to create units that are:

  • Small enough to be retrieved precisely
  • Large enough to contain meaningful context
  • Self-contained enough to be understood without surrounding text

There is no universal right chunk size — it depends on the document type and retrieval task.

2. Strategy by Document Type

Plain text / articles:

  • Recursive character splitting: try paragraphs first, then sentences, then words
  • Typical chunk size: 256-512 tokens with 50-token overlap
  • Include document title and section heading as metadata

PDFs:

  • Use a PDF parser that preserves layout (PyMuPDF, pdfplumber)
  • Split on page boundaries first, then recursively
  • Handle tables separately — extract as structured data

Code:

  • Never split mid-function or mid-class
  • Use AST-aware chunking: split at function, class, or module boundaries
  • Include file path and language as metadata

HTML / web pages:

  • Strip navigation, ads, boilerplate before chunking
  • Split on heading tags (h1, h2, h3) to preserve structure
  • Include breadcrumb/URL path as metadata

Structured data (CSV, JSON):

  • Embed as natural language descriptions
  • "Row 42: Customer John Smith, order total $245.00, status: shipped"

3. Chunk Overlap

Always add overlap (10-20% of chunk size) to avoid losing information at boundaries.

4. Metadata is Part of the Chunk

Every chunk should carry metadata:

{
  "text": "chunk content here...",
  "source": "product-docs/api-reference.pdf",
  "page": 12,
  "section": "Authentication",
  "document_id": "abc123"
}

Common Follow-ups

  1. "How do you choose the right chunk size?" Empirically: test different sizes on your eval dataset and measure retrieval precision@k. Common starting points: 256 tokens for precise retrieval, 512 for richer context.

  2. "What is late chunking and when would you use it?" In standard RAG, you split a document into chunks and embed each chunk independently — so each chunk's embedding has no awareness of the rest of the document. Late chunking (introduced by JinaAI, 2024) instead runs the full document through the embedding model in a single forward pass, producing contextually-aware token-level embeddings where each token has attended to the whole document. You then apply mean-pooling within your chunk boundaries to produce chunk-level vectors. Because the token representations were shaped by full-document context before pooling, the resulting chunk embeddings carry more global context than independently embedded chunks. Useful when document-wide context significantly affects the meaning of individual sections — such as scientific papers where the abstract and conclusions inform the methods section.

  3. "How would you handle documents with very short sections?" Combine short sections into larger chunks to hit target token count. Or embed each Q&A pair individually but add the question as metadata for the answer chunk.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]