Why This Is Asked
Most RAG demos use clean markdown or plain text. Production systems deal with annual reports, financial statements, technical specs, and slide decks — full of tables, charts, and complex layouts. Interviewers ask this to distinguish candidates with real implementation experience from those who've only run tutorials.
Key Concepts to Cover
- Table extraction and representation — how to make tables retrievable
- Chart and graph handling — when to use OCR vs. multimodal models
- Complex PDF parsing — scanned vs. native PDFs, layout extraction
- Multimodal embeddings — embedding images alongside text
- Production indexing pipelines — handling heterogeneous document types at scale
How to Approach This
1. The Core Problem
Standard text extraction (PyPDF2, pdfplumber) destroys table structure — columns become one long string, merged cells split incorrectly, and multi-line rows get separated. A chunk containing a broken table fragment will embed poorly and confuse the LLM even if retrieved.
2. Handling Tables
Option A: Convert to Markdown
- Tools like
pdfplumber,camelot, orunstructuredcan detect table regions and output them as Markdown - Markdown tables are human-readable and embed reasonably well for simple lookup queries
- Limitation: doesn't work well for merged cells, nested headers, or cross-page tables
Option B: Convert to structured JSON
- Represent each table as JSON (array of row objects with column keys)
- Store the JSON alongside a natural-language description generated by an LLM: "This table shows quarterly revenue by region for 2023-2024"
- Embed the description, store the JSON as metadata, retrieve description + inject full JSON into context
Option C: Embed per-row or per-cell
- For very large tables (financial statements, pricing tables), embed each row as a separate chunk with column headers prepended
- Allows precise retrieval of individual data points rather than the whole table
For large tables with many rows: use structured retrieval — query a SQL-style index rather than vector similarity, or use text-to-SQL with the table stored in a relational format alongside the vector index.
3. Handling Charts and Graphs
Charts contain information that text extraction completely misses.
Option A: Caption + surrounding text
- If the chart has a title and caption, those can be extracted and embedded
- Works for simple labeled charts; fails for complex visualizations with no surrounding text
Option B: Multimodal LLM description
- Pass the chart image to a vision model (GPT-4o, Claude 3.5 Sonnet) and ask it to describe the key insights: "This bar chart shows model accuracy by dataset. GPT-4o leads on MMLU (87.4%) while Claude 3.5 Sonnet leads on HumanEval (92%)."
- Store the generated description as a text chunk; embed and retrieve normally
- Higher cost but much higher accuracy
Option C: Multimodal embeddings
- Models like CLIP or
nomic-embed-visioncan embed images directly into the same vector space as text - Allows retrieving charts in response to text queries
- Still emerging; text description approach is more mature for production
4. Scanned vs. Native PDFs
- Native PDFs: text is encoded in the PDF structure; extractable with
pdfplumberorpypdf - Scanned PDFs: images of pages; require OCR (Tesseract, AWS Textract, Google Document AI)
- Detect which type you have before choosing a parser — running OCR on a native PDF wastes cost and degrades quality
For high-accuracy production systems (annual reports, legal docs), AWS Textract or Google Document AI provide table-aware extraction with layout understanding that open-source tools can't match.
5. Production Document Processing Pipeline
Document ingested
↓
Detect type (PDF/DOCX/HTML/image)
↓
Parse with type-appropriate extractor
↓
Layout analysis (detect tables, figures, headers)
↓
Per-element processing:
- Text blocks → chunking → embedding
- Tables → structured extraction → JSON + description → embedding
- Charts/images → vision LLM description → embedding
↓
Store chunks + metadata in vector DB
Use unstructured.io as an open-source starting point — it handles multi-element classification and extraction across document types. For scale, run as a distributed worker pool with document-level idempotency (skip already-processed docs on re-runs).
Common Follow-ups
-
"How do you handle a table that spans multiple pages?" Most parsers split at page boundaries. Detection requires checking if the last row of page N and the first row of page N+1 share the same header structure. Tools like Camelot have
--pagesspanning support; otherwise, post-process by merging fragments with identical headers. -
"How would you index and retrieve from a 500-page annual report?" Treat it as a hierarchical document: section → subsection → chunk. Preserve section metadata (e.g., "Management Discussion and Analysis > Revenue") on every chunk. For the financial statements section, use structured table extraction. Use metadata filters to restrict retrieval to relevant sections before running vector search.
-
"When would you use a multimodal RAG approach end-to-end?" When documents are primarily visual (architecture diagrams, slide decks, engineering schematics) and text extraction would lose most of the information. Multimodal embedding + multimodal LLM retrieval is still maturing — use it when text-based extraction genuinely fails, not as a default.