Intermediate2 min read

Design a Document Q&A System for a Large Corpus

Design an AI system that answers natural language questions over a large collection of documents, with accurate citations and low hallucination rates.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Document Q&A is one of the most practical and frequently built AI features. It tests your ability to apply RAG to a concrete product requirement, handle real-world complications, and think about UX.

Key Concepts to Cover

  • Ingestion pipeline — parsing, cleaning, chunking documents at scale
  • Multi-document retrieval — finding relevant content across many sources
  • Citation and sourcing — attributing answers to specific documents
  • Conflicting information — handling contradictions across documents
  • Query understanding — rewriting vague queries
  • Access control — users should only see authorized documents

How to Approach This

1. Clarify Requirements

  • What document types? (PDFs, Word, HTML, code?)
  • How large is the corpus? (100 docs vs. 10M?)
  • What latency is acceptable?
  • Do users need citations?
  • Any access control?

2. High-Level Architecture

Documents → Ingestion Pipeline → Vector Store
User Query → Query Processor → Retriever → Context Builder → LLM → Answer + Citations

3. Ingestion at Scale

For 10M+ documents:

  • Distributed ingestion workers
  • Document deduplication (hash-based)
  • Incremental updates (re-process only changed documents)
  • Metadata extraction: author, date, source, document type

4. Citations

  • Chunk-level citation: tag each chunk with source document and page
  • Post-hoc attribution: ask LLM to annotate which sentence came from which source
  • Inline citation format: instruct LLM to output [1], [2] references
  • Validate: verify each citation actually supports the claim

5. Query Understanding

  • Query rewriting: expand vague queries with context
  • HyDE (Hypothetical Document Embeddings): instead of embedding the query directly, ask the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. This works because the hypothetical answer uses the vocabulary and style of relevant documents — producing a retrieval vector that's closer to real answers in embedding space than the question alone, especially useful when question and answer vocabulary differ significantly
  • Multi-query retrieval: generate multiple phrasings and union results

Common Follow-ups

  1. "How would you handle a question that spans multiple documents?" Retrieving from multiple sources, map-reduce summarization, iterative retrieval.

  2. "How would you measure accuracy?" Golden Q&A dataset, exact match, semantic similarity, citation accuracy, human evaluation.

  3. "How do you handle documents that are updated or deleted?" Re-indexing on change events, chunk-level versioning, invalidating cached answers.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview