Advanced4 min read

A Client's RAG System Has Poor Retrieval Accuracy — How Do You Fix It?

A RAG-based system isn't returning accurate results. Walk through a systematic process to diagnose the root cause and improve retrieval quality.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

This is a practical debugging question that tests whether you can move beyond theory. Most candidates can describe how RAG works; fewer can diagnose a broken one. Interviewers want to see a systematic, hypothesis-driven approach — not a list of random techniques.

Key Concepts to Cover

  • Separating retrieval failures from generation failures
  • Chunking quality issues — size, overlap, structure
  • Embedding model mismatch — domain, asymmetry, model quality
  • Query-side problems — short/ambiguous queries, vocabulary gap
  • Retrieval strategy gaps — pure dense search limitations
  • Evaluation metrics — how to measure retrieval quality objectively

How to Approach This

1. Isolate: Is It Retrieval or Generation?

First, establish what's actually broken. Run retrieval in isolation:

  • For a set of test queries, log the top-k retrieved chunks
  • Manually evaluate: is the relevant information present in the retrieved chunks?
  • If yes: the problem is in generation (prompting, context assembly, LLM)
  • If no: the problem is in retrieval — proceed below

This step is skipped by most candidates and will immediately impress an interviewer.

2. Measure Before You Fix

Build or use an evaluation set:

  • 50-100 queries with known relevant documents (golden set)
  • Measure recall@k (is the relevant chunk in the top k results?) and MRR (mean reciprocal rank)
  • Establish a baseline before making any changes — otherwise you won't know if your fix worked

3. Diagnose: The Four Root Causes

A. Chunking problems

  • Chunks are too large → relevant sentence is diluted by irrelevant context → embedding doesn't match query
  • Chunks are too small → answer spans chunk boundaries → never fully retrieved
  • No overlap → content at chunk edges is hard to retrieve
  • Fix: experiment with smaller chunks + overlap; use semantic chunking to split on topic boundaries

B. Embedding model problems

  • Generic model doesn't understand domain vocabulary (legal, medical, financial, code)
  • Query-document length asymmetry — model trained on symmetric pairs but your queries are short and docs are long
  • Fix: evaluate domain-specific models; try msmarco-family for asymmetric retrieval; consider fine-tuning on domain pairs

C. Query-side problems

  • Queries are short and ambiguous — poor embedding quality
  • Vocabulary mismatch — user says "how to cancel" but docs say "subscription termination"
  • Fix: query expansion (add synonyms/context), HyDE (embed a hypothetical answer instead of the query), multi-query retrieval (generate 3-5 query variants and union results)

D. Retrieval strategy limitations

  • Pure dense search misses exact keyword matches (product names, codes, IDs)
  • Fix: add BM25 sparse retrieval and combine with reciprocal rank fusion (hybrid search)

4. Systematic Fix Order

  1. Add evaluation metrics (if not present)
  2. Fix chunking — usually highest ROI, easiest to change
  3. Try hybrid search — adds sparse retrieval without changing embedding model
  4. Add query rewriting/expansion
  5. Benchmark and switch embedding models if domain mismatch is confirmed
  6. Add re-ranking (cross-encoder) on top-k results — largest accuracy boost, adds latency

5. Re-ranking as Final Layer

Cross-encoders (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) take the query + each candidate chunk as a joint input and produce a relevance score. They're much more accurate than embedding similarity because they can model fine-grained query-document interaction — but they're O(k) at inference time. Apply only to the top 20-50 ANN results, not the full corpus.

Common Follow-ups

  1. "How do you build a golden evaluation set when you have no labeled data?" Use LLM-assisted labeling: for each chunk, ask the LLM to generate synthetic questions it would answer, then use those as positive pairs. Tools like Ragas and DeepEval automate this.

  2. "What if the retrieval is good but the final answer is still wrong?" The problem is in context assembly or generation: too many retrieved chunks confusing the LLM (context stuffing), contradicting chunks, or a prompt that doesn't instruct the LLM to ground its answer. Fix: reduce k, add explicit grounding instructions, use structured prompts that separate context from instructions.

  3. "How do you handle queries that require information from multiple chunks?" See multi-hop query handling — iterative retrieval, query decomposition, or a map-reduce approach where sub-queries are answered independently and then merged.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview