RAG Interview Guide — What It Is and How to Explain It

What Is RAG and Why Do Interviewers Ask About It

RAG augments a language model's input with documents retrieved from an external knowledge base. Instead of relying on what the model memorized during training, you fetch relevant context at query time.

Nearly every company building production AI uses RAG. It solves the problem of LLMs not knowing your proprietary data — without the cost of fine-tuning.

Interview Tip

RAG questions reveal how you think about systems. The pipeline has 6+ stages, each with meaningful tradeoffs. Candidates who walk through the full pipeline and reason about where things break stand out immediately.

The Two Pipelines

Every RAG system has two distinct pipelines with different engineering challenges:

Ingestion Pipeline (offline)

Documents

Preprocessing

Chunking

Embedding

Vector DB

Query Pipeline (real-time)

User Query

Embedding

Vector Search

Re-ranking

Context Assembly

LLM

Response

Why this matters

Ingestion is a throughput problem (batch processing, deduplication, freshness). Query is a latency problem (search speed, re-ranking cost, generation time). Confusing the two leads to bad architecture decisions.

Core Concepts

Chunking

How you split documents into retrievable units. Fixed-size is simple but cuts mid-sentence. Semantic chunking preserves meaning but varies in length. Recursive splitting is the practical middle ground.

Embeddings

Dense vectors that capture semantic meaning. The same model must be used at ingestion and query time. Larger models = better quality but more cost and storage.

Hybrid Search

Combines vector similarity (semantic) with BM25 (keyword matching) using reciprocal rank fusion. Catches exact matches that pure vector search misses.

Re-ranking

A cross-encoder scores query + document pairs together. Much more accurate than embedding similarity, but O(n) — only applied to a shortlist of 20-50 results.

Context Assembly

Ordering retrieved chunks, deduplicating overlaps, and fitting within the context window. The prompt structure grounds the model to answer only from provided context.

Evaluation

Two layers: retrieval metrics (precision@k, recall@k, MRR) and generation metrics (faithfulness, relevance). Automated frameworks like RAGAS help, but human eval remains critical.

Key Tradeoffs

Decisions that define your RAG system

Decision	Tradeoff
Chunk size	Smaller = more precise retrieval, Larger = more context per chunk
Retrieved chunks (k)	More = better recall, More = higher cost and latency
Embedding model size	Larger = better quality, Higher latency and storage cost
Re-ranking	Better precision, Adds 100-200ms latency
Chunk overlap	Prevents info loss at boundaries, Increases storage by 10-20%

How to Explain This in an Interview

Open with the two-pipeline structure

"I would separate this into an offline ingestion pipeline and a real-time query pipeline, because the engineering requirements are fundamentally different."

Walk through each stage sequentially

Name the key tradeoff at each stage and state your default choice with a reason. Do not just list components — explain why you chose them.

Discuss failure modes

What happens when retrieval returns irrelevant chunks? When chunks contradict each other? When the corpus updates frequently?

Quantify when possible

"At 10K queries/day with 500-token chunks and k=5, we would retrieve about 2500 tokens per query, keeping us well within the context window."

Common Mistake

The most common mistake: describing RAG as "just vector search + LLM." Interviewers want to hear about chunking decisions, hybrid retrieval, re-ranking, and evaluation. The second mistake is never discussing failure modes.