Intermediate5 min read

How Do You Evaluate a RAG System End-to-End?

RAG evaluation is distinct from general LLM evaluation — it requires measuring both retrieval quality and generation quality independently and together. Walk through the key metrics and frameworks.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Most candidates can describe how to build a RAG system. Fewer know how to tell if it's working. RAG evaluation is more complex than general LLM eval because you have two failure modes — bad retrieval and bad generation — that produce similar symptoms (wrong answers) but require different fixes.

Key Concepts to Cover

  • The two-stage evaluation model — retrieval eval vs. generation eval
  • Retrieval metrics — precision@k, recall@k, MRR, NDCG
  • Generation metrics — faithfulness, answer relevancy, context utilization
  • RAGAS framework — the standard open-source RAG eval library
  • Building an evaluation set — with and without labeled data
  • Online vs. offline eval — production monitoring vs. pre-deployment testing

How to Approach This

1. The Two-Stage Model

RAG failures have two root causes that must be evaluated independently:

Retrieval failure: the relevant information is not in the retrieved chunks → the LLM can't answer correctly regardless of how good it is

Generation failure: the relevant chunks were retrieved but the LLM ignored them, misinterpreted them, or hallucinated instead

Separate evaluation allows targeted fixes. If retrieval metrics are good but the final answer is wrong, the problem is in prompting or generation. If retrieval metrics are poor, fix retrieval first.

2. Retrieval Metrics

Require a golden dataset: a set of (query, relevant_document_ids) pairs.

Precision@k: of the top k retrieved chunks, what fraction are relevant?

  • High precision = few irrelevant chunks in context (less noise for LLM)
  • Precision@5 = relevant_in_top_5 / 5

Recall@k: of all relevant chunks for this query, what fraction appear in the top k?

  • High recall = important information is present in context
  • Recall@5 = relevant_in_top_5 / total_relevant

MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant result

  • Measures how high the first relevant chunk ranks
  • Better for use cases where one right answer is enough (Q&A, not research)

NDCG (Normalized Discounted Cumulative Gain): weighted by position and relevance grade

  • Better for multi-graded relevance (very relevant vs. somewhat relevant)

Practical targets:

  • Recall@5 > 0.85 for production RAG (you need the relevant chunk to be in context)
  • Precision@5 > 0.5 is reasonable (some noise is acceptable)

3. Generation Metrics (RAGAS Framework)

RAGAS (RAG Assessment) is the standard open-source library for evaluating RAG pipelines. It uses an LLM judge to score generated answers.

Faithfulness: does the answer contain only claims that are supported by the retrieved context?

  • Detects hallucination and unsupported claims
  • RAGAS: extracts claims from the answer, verifies each against the context using an LLM judge
  • Target: > 0.9 for factual applications

Answer Relevancy: how well does the generated answer address the question?

  • RAGAS: generates hypothetical questions from the answer, measures cosine similarity to the original question
  • Detects answers that are factually correct but don't address what was asked

Context Precision: are the retrieved chunks actually useful, in ranked order?

  • Are the most relevant chunks ranked first, or buried at position 8?
  • Important because LLMs attend better to content at the beginning of context

Context Recall: was all the information needed to answer the question present in the retrieved chunks?

  • Requires a reference answer to compare against

4. Building an Evaluation Dataset

With labeled data:

  • Use existing Q&A pairs from support tickets, FAQs, user queries with known answers
  • Manually curate a set of 100-500 representative queries with relevant document labels and reference answers

Without labeled data (LLM-assisted):

  1. Sample chunks from your corpus
  2. Ask an LLM to generate 2-3 questions that each chunk would answer
  3. Use these as synthetic (query, ground_truth_context) pairs
  4. Validate a sample manually before using for evaluation

Synthetic data is sufficient for relative comparisons (did my change improve things?) but less reliable for absolute quality benchmarks.

5. Online Evaluation (Production Monitoring)

Offline eval tests the system before deployment. Online eval monitors it in production.

Implicit feedback signals:

  • Session length (did the user continue engaging or abandon?)
  • Follow-up questions (a clarifying follow-up often signals the first answer was wrong)
  • Copy/paste events (users copy good answers)
  • Thumbs up/down if exposed in UI

Active monitoring:

  • Run a subset of queries through the full RAG pipeline + a faithfulness checker on every response
  • Alert when faithfulness score drops below threshold (signals data drift or model degradation)
  • Track retrieval success rate: queries where the retrieved context contained the answer vs. those where it didn't

Shadow evaluation:

  • Run a small percentage of live queries through an updated model/retrieval configuration
  • Compare RAGAS scores before promoting to full production

Common Follow-ups

  1. "How do you set up continuous eval so you catch regressions automatically?" Add a post-deployment eval job that runs your golden eval set against the live system weekly. Alert on any metric that drops more than a threshold (e.g., faithfulness < 0.85). Tie this to a CI gate so model/prompt changes must pass eval before deploying.

  2. "What's the difference between RAGAS faithfulness and hallucination rate?" Faithfulness is a per-answer score (0-1) measuring how many claims are supported. Hallucination rate is a binary per-answer metric (hallucinated or not). Faithfulness is more granular and actionable; hallucination rate is easier to explain to stakeholders.

  3. "How do you evaluate a RAG system when you have no ground truth at all?" Use self-consistency as a proxy: ask the same question multiple times with different retrieved contexts and check if answers are consistent. High inconsistency signals unreliable retrieval or generation. Also run user studies or expert review for the first 500 questions before scaling.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview