Why This Is Asked
This is a practical debugging question that tests whether you can move beyond theory. Most candidates can describe how RAG works; fewer can diagnose a broken one. Interviewers want to see a systematic, hypothesis-driven approach — not a list of random techniques.
Key Concepts to Cover
- Separating retrieval failures from generation failures
- Chunking quality issues — size, overlap, structure
- Embedding model mismatch — domain, asymmetry, model quality
- Query-side problems — short/ambiguous queries, vocabulary gap
- Retrieval strategy gaps — pure dense search limitations
- Evaluation metrics — how to measure retrieval quality objectively
How to Approach This
1. Isolate: Is It Retrieval or Generation?
First, establish what's actually broken. Run retrieval in isolation:
- For a set of test queries, log the top-k retrieved chunks
- Manually evaluate: is the relevant information present in the retrieved chunks?
- If yes: the problem is in generation (prompting, context assembly, LLM)
- If no: the problem is in retrieval — proceed below
This step is skipped by most candidates and will immediately impress an interviewer.
2. Measure Before You Fix
Build or use an evaluation set:
- 50-100 queries with known relevant documents (golden set)
- Measure recall@k (is the relevant chunk in the top k results?) and MRR (mean reciprocal rank)
- Establish a baseline before making any changes — otherwise you won't know if your fix worked
3. Diagnose: The Four Root Causes
A. Chunking problems
- Chunks are too large → relevant sentence is diluted by irrelevant context → embedding doesn't match query
- Chunks are too small → answer spans chunk boundaries → never fully retrieved
- No overlap → content at chunk edges is hard to retrieve
- Fix: experiment with smaller chunks + overlap; use semantic chunking to split on topic boundaries
B. Embedding model problems
- Generic model doesn't understand domain vocabulary (legal, medical, financial, code)
- Query-document length asymmetry — model trained on symmetric pairs but your queries are short and docs are long
- Fix: evaluate domain-specific models; try
msmarco-family for asymmetric retrieval; consider fine-tuning on domain pairs
C. Query-side problems
- Queries are short and ambiguous — poor embedding quality
- Vocabulary mismatch — user says "how to cancel" but docs say "subscription termination"
- Fix: query expansion (add synonyms/context), HyDE (embed a hypothetical answer instead of the query), multi-query retrieval (generate 3-5 query variants and union results)
D. Retrieval strategy limitations
- Pure dense search misses exact keyword matches (product names, codes, IDs)
- Fix: add BM25 sparse retrieval and combine with reciprocal rank fusion (hybrid search)
4. Systematic Fix Order
- Add evaluation metrics (if not present)
- Fix chunking — usually highest ROI, easiest to change
- Try hybrid search — adds sparse retrieval without changing embedding model
- Add query rewriting/expansion
- Benchmark and switch embedding models if domain mismatch is confirmed
- Add re-ranking (cross-encoder) on top-k results — largest accuracy boost, adds latency
5. Re-ranking as Final Layer
Cross-encoders (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) take the query + each candidate chunk as a joint input and produce a relevance score. They're much more accurate than embedding similarity because they can model fine-grained query-document interaction — but they're O(k) at inference time. Apply only to the top 20-50 ANN results, not the full corpus.
Common Follow-ups
-
"How do you build a golden evaluation set when you have no labeled data?" Use LLM-assisted labeling: for each chunk, ask the LLM to generate synthetic questions it would answer, then use those as positive pairs. Tools like Ragas and DeepEval automate this.
-
"What if the retrieval is good but the final answer is still wrong?" The problem is in context assembly or generation: too many retrieved chunks confusing the LLM (context stuffing), contradicting chunks, or a prompt that doesn't instruct the LLM to ground its answer. Fix: reduce k, add explicit grounding instructions, use structured prompts that separate context from instructions.
-
"How do you handle queries that require information from multiple chunks?" See multi-hop query handling — iterative retrieval, query decomposition, or a map-reduce approach where sub-queries are answered independently and then merged.