Intermediate5 min read

How Do You Handle Tables, Charts, and Complex Documents in a RAG Pipeline?

Real-world documents contain tables, charts, and complex layouts that naive text extraction mangles. Walk through how to build a robust document processing pipeline for structured and visual content.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Most RAG demos use clean markdown or plain text. Production systems deal with annual reports, financial statements, technical specs, and slide decks — full of tables, charts, and complex layouts. Interviewers ask this to distinguish candidates with real implementation experience from those who've only run tutorials.

Key Concepts to Cover

  • Table extraction and representation — how to make tables retrievable
  • Chart and graph handling — when to use OCR vs. multimodal models
  • Complex PDF parsing — scanned vs. native PDFs, layout extraction
  • Multimodal embeddings — embedding images alongside text
  • Production indexing pipelines — handling heterogeneous document types at scale

How to Approach This

1. The Core Problem

Standard text extraction (PyPDF2, pdfplumber) destroys table structure — columns become one long string, merged cells split incorrectly, and multi-line rows get separated. A chunk containing a broken table fragment will embed poorly and confuse the LLM even if retrieved.

2. Handling Tables

Option A: Convert to Markdown

  • Tools like pdfplumber, camelot, or unstructured can detect table regions and output them as Markdown
  • Markdown tables are human-readable and embed reasonably well for simple lookup queries
  • Limitation: doesn't work well for merged cells, nested headers, or cross-page tables

Option B: Convert to structured JSON

  • Represent each table as JSON (array of row objects with column keys)
  • Store the JSON alongside a natural-language description generated by an LLM: "This table shows quarterly revenue by region for 2023-2024"
  • Embed the description, store the JSON as metadata, retrieve description + inject full JSON into context

Option C: Embed per-row or per-cell

  • For very large tables (financial statements, pricing tables), embed each row as a separate chunk with column headers prepended
  • Allows precise retrieval of individual data points rather than the whole table

For large tables with many rows: use structured retrieval — query a SQL-style index rather than vector similarity, or use text-to-SQL with the table stored in a relational format alongside the vector index.

3. Handling Charts and Graphs

Charts contain information that text extraction completely misses.

Option A: Caption + surrounding text

  • If the chart has a title and caption, those can be extracted and embedded
  • Works for simple labeled charts; fails for complex visualizations with no surrounding text

Option B: Multimodal LLM description

  • Pass the chart image to a vision model (GPT-4o, Claude 3.5 Sonnet) and ask it to describe the key insights: "This bar chart shows model accuracy by dataset. GPT-4o leads on MMLU (87.4%) while Claude 3.5 Sonnet leads on HumanEval (92%)."
  • Store the generated description as a text chunk; embed and retrieve normally
  • Higher cost but much higher accuracy

Option C: Multimodal embeddings

  • Models like CLIP or nomic-embed-vision can embed images directly into the same vector space as text
  • Allows retrieving charts in response to text queries
  • Still emerging; text description approach is more mature for production

4. Scanned vs. Native PDFs

  • Native PDFs: text is encoded in the PDF structure; extractable with pdfplumber or pypdf
  • Scanned PDFs: images of pages; require OCR (Tesseract, AWS Textract, Google Document AI)
  • Detect which type you have before choosing a parser — running OCR on a native PDF wastes cost and degrades quality

For high-accuracy production systems (annual reports, legal docs), AWS Textract or Google Document AI provide table-aware extraction with layout understanding that open-source tools can't match.

5. Production Document Processing Pipeline

Document ingested
    ↓
Detect type (PDF/DOCX/HTML/image)
    ↓
Parse with type-appropriate extractor
    ↓
Layout analysis (detect tables, figures, headers)
    ↓
Per-element processing:
    - Text blocks → chunking → embedding
    - Tables → structured extraction → JSON + description → embedding
    - Charts/images → vision LLM description → embedding
    ↓
Store chunks + metadata in vector DB

Use unstructured.io as an open-source starting point — it handles multi-element classification and extraction across document types. For scale, run as a distributed worker pool with document-level idempotency (skip already-processed docs on re-runs).

Common Follow-ups

  1. "How do you handle a table that spans multiple pages?" Most parsers split at page boundaries. Detection requires checking if the last row of page N and the first row of page N+1 share the same header structure. Tools like Camelot have --pages spanning support; otherwise, post-process by merging fragments with identical headers.

  2. "How would you index and retrieve from a 500-page annual report?" Treat it as a hierarchical document: section → subsection → chunk. Preserve section metadata (e.g., "Management Discussion and Analysis > Revenue") on every chunk. For the financial statements section, use structured table extraction. Use metadata filters to restrict retrieval to relevant sections before running vector search.

  3. "When would you use a multimodal RAG approach end-to-end?" When documents are primarily visual (architecture diagrams, slide decks, engineering schematics) and text extraction would lose most of the information. Multimodal embedding + multimodal LLM retrieval is still maturing — use it when text-based extraction genuinely fails, not as a default.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview