RAG evaluation is distinct from general LLM evaluation — it requires measuring both retrieval quality and generation quality independently and together. Walk through the key metrics and frameworks.

Learn how to evaluate RAG systems end-to-end. Covers retrieval metrics (precision, recall, MRR), generation metrics (faithfulness, answer relevancy), and the RAGAS framework.

RAG Evaluation Metrics: Faithfulness, Relevance, RAGAS

Why This Is Asked

Most candidates can describe how to build a RAG system. Fewer know how to tell if it's working. RAG evaluation is more complex than general LLM eval because you have two failure modes — bad retrieval and bad generation — that produce similar symptoms (wrong answers) but require different fixes.

Key Concepts to Cover

The two-stage evaluation model — retrieval eval vs. generation eval
Retrieval metrics — precision@k, recall@k, MRR, NDCG
Generation metrics — faithfulness, answer relevancy, context utilization
RAGAS framework — the standard open-source RAG eval library
Building an evaluation set — with and without labeled data
Online vs. offline eval — production monitoring vs. pre-deployment testing

How to Approach This

1. The Two-Stage Model

RAG failures have two root causes that must be evaluated independently:

Retrieval failure: the relevant information is not in the retrieved chunks → the LLM can't answer correctly regardless of how good it is

Generation failure: the relevant chunks were retrieved but the LLM ignored them, misinterpreted them, or hallucinated instead

Separate evaluation allows targeted fixes. If retrieval metrics are good but the final answer is wrong, the problem is in prompting or generation. If retrieval metrics are poor, fix retrieval first.

2. Retrieval Metrics

Require a golden dataset: a set of (query, relevant_document_ids) pairs.

Precision@k: of the top k retrieved chunks, what fraction are relevant?

High precision = few irrelevant chunks in context (less noise for LLM)
Precision@5 = relevant_in_top_5 / 5

Recall@k: of all relevant chunks for this query, what fraction appear in the top k?

High recall = important information is present in context
Recall@5 = relevant_in_top_5 / total_relevant

MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant result

Measures how high the first relevant chunk ranks
Better for use cases where one right answer is enough (Q&A, not research)

NDCG (Normalized Discounted Cumulative Gain): weighted by position and relevance grade

Better for multi-graded relevance (very relevant vs. somewhat relevant)

Practical targets:

Recall@5 > 0.85 for production RAG (you need the relevant chunk to be in context)
Precision@5 > 0.5 is reasonable (some noise is acceptable)

3. Generation Metrics (RAGAS Framework)

RAGAS (RAG Assessment) is the standard open-source library for evaluating RAG pipelines. It uses an LLM judge to score generated answers.

Faithfulness: does the answer contain only claims that are supported by the retrieved context?

Detects hallucination and unsupported claims
RAGAS: extracts claims from the answer, verifies each against the context using an LLM judge
Target: > 0.9 for factual applications

Answer Relevancy: how well does the generated answer address the question?

RAGAS: generates hypothetical questions from the answer, measures cosine similarity to the original question
Detects answers that are factually correct but don't address what was asked

Context Precision: are the retrieved chunks actually useful, in ranked order?

Are the most relevant chunks ranked first, or buried at position 8?
Important because LLMs attend better to content at the beginning of context

Context Recall: was all the information needed to answer the question present in the retrieved chunks?

Requires a reference answer to compare against

4. Building an Evaluation Dataset

With labeled data:

Use existing Q&A pairs from support tickets, FAQs, user queries with known answers
Manually curate a set of 100-500 representative queries with relevant document labels and reference answers

Without labeled data (LLM-assisted):

Sample chunks from your corpus
Ask an LLM to generate 2-3 questions that each chunk would answer
Use these as synthetic (query, ground_truth_context) pairs
Validate a sample manually before using for evaluation

Synthetic data is sufficient for relative comparisons (did my change improve things?) but less reliable for absolute quality benchmarks.

5. Online Evaluation (Production Monitoring)

Offline eval tests the system before deployment. Online eval monitors it in production.

Implicit feedback signals:

Session length (did the user continue engaging or abandon?)
Follow-up questions (a clarifying follow-up often signals the first answer was wrong)
Copy/paste events (users copy good answers)
Thumbs up/down if exposed in UI

Active monitoring:

Run a subset of queries through the full RAG pipeline + a faithfulness checker on every response
Alert when faithfulness score drops below threshold (signals data drift or model degradation)
Track retrieval success rate: queries where the retrieved context contained the answer vs. those where it didn't

Shadow evaluation:

Run a small percentage of live queries through an updated model/retrieval configuration
Compare RAGAS scores before promoting to full production

Common Follow-ups

"How do you set up continuous eval so you catch regressions automatically?" Add a post-deployment eval job that runs your golden eval set against the live system weekly. Alert on any metric that drops more than a threshold (e.g., faithfulness < 0.85). Tie this to a CI gate so model/prompt changes must pass eval before deploying.
"What's the difference between RAGAS faithfulness and hallucination rate?" Faithfulness is a per-answer score (0-1) measuring how many claims are supported. Hallucination rate is a binary per-answer metric (hallucinated or not). Faithfulness is more granular and actionable; hallucination rate is easier to explain to stakeholders.
"How do you evaluate a RAG system when you have no ground truth at all?" Use self-consistency as a proxy: ask the same question multiple times with different retrieved contexts and check if answers are consistent. High inconsistency signals unreliable retrieval or generation. Also run user studies or expert review for the first 500 questions before scaling.

How Do You Evaluate a RAG System End-to-End?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. The Two-Stage Model

2. Retrieval Metrics

3. Generation Metrics (RAGAS Framework)

4. Building an Evaluation Dataset

5. Online Evaluation (Production Monitoring)

Common Follow-ups

Related Questions

How Would You Evaluate Retrieval Quality in a RAG System?

How Do You Build an Eval Suite for an LLM-Powered Feature?

What Metrics Would You Track for an LLM in Production?

Design a RAG Pipeline from Scratch

Prep for the full interview loop