Design an AI system that answers natural language questions over a large collection of documents, with accurate citations and low hallucination rates.

Learn how to design a document Q&A system for a large corpus. Covers RAG, chunking, multi-document reasoning, citations, and evaluation.

Design a Document Q&A System - AI Interview Question

Why This Is Asked

Document Q&A is one of the most practical and frequently built AI features. It tests your ability to apply RAG to a concrete product requirement, handle real-world complications, and think about UX.

Key Concepts to Cover

Ingestion pipeline — parsing, cleaning, chunking documents at scale
Multi-document retrieval — finding relevant content across many sources
Citation and sourcing — attributing answers to specific documents
Conflicting information — handling contradictions across documents
Query understanding — rewriting vague queries
Access control — users should only see authorized documents

How to Approach This

1. Clarify Requirements

What document types? (PDFs, Word, HTML, code?)
How large is the corpus? (100 docs vs. 10M?)
What latency is acceptable?
Do users need citations?
Any access control?

2. High-Level Architecture

Documents → Ingestion Pipeline → Vector Store
User Query → Query Processor → Retriever → Context Builder → LLM → Answer + Citations

3. Ingestion at Scale

For 10M+ documents:

Distributed ingestion workers
Document deduplication (hash-based)
Incremental updates (re-process only changed documents)
Metadata extraction: author, date, source, document type

4. Citations

Chunk-level citation: tag each chunk with source document and page
Post-hoc attribution: ask LLM to annotate which sentence came from which source
Inline citation format: instruct LLM to output [1], [2] references
Validate: verify each citation actually supports the claim

5. Query Understanding

Query rewriting: expand vague queries with context
HyDE (Hypothetical Document Embeddings): instead of embedding the query directly, ask the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. This works because the hypothetical answer uses the vocabulary and style of relevant documents — producing a retrieval vector that's closer to real answers in embedding space than the question alone, especially useful when question and answer vocabulary differ significantly
Multi-query retrieval: generate multiple phrasings and union results

Common Follow-ups

"How would you handle a question that spans multiple documents?" Retrieving from multiple sources, map-reduce summarization, iterative retrieval.
"How would you measure accuracy?" Golden Q&A dataset, exact match, semantic similarity, citation accuracy, human evaluation.
"How do you handle documents that are updated or deleted?" Re-indexing on change events, chunk-level versioning, invalidating cached answers.

Design a Document Q&A System for a Large Corpus

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Clarify Requirements

2. High-Level Architecture

3. Ingestion at Scale

4. Citations

5. Query Understanding

Common Follow-ups

Related Questions

Design a RAG Pipeline from Scratch

How Do You Handle Chunking Strategies for Different Document Types?

How Would You Evaluate Retrieval Quality in a RAG System?

Prep for the full interview loop