4 min read

LLM Evaluation: Metrics, Evals, and Production Monitoring

Evaluation is how you know your AI system actually works. This guide covers the metrics, frameworks, and monitoring strategies interviewers expect you to know — and how to talk about them under pressure.

What Is LLM Evaluation and Why Do Interviewers Ask About It

LLM outputs are probabilistic — the same input can produce different outputs, and "correctness" is often subjective. Evaluation is the discipline of measuring whether your AI system produces correct, useful, and safe outputs despite this uncertainty.

Interview Tip

Evaluation questions separate engineers who have shipped AI in production from those who built demos. Without eval, you are flying blind — and interviewers know it.

Three Layers of Evaluation

1
Offline evaluation
Run against a fixed dataset before deployment. Compare model outputs to ground truth or use a judge model. This is your pre-deployment safety net.
2
Online monitoring
Track live production traffic: latency, error rates, user feedback signals (thumbs up/down, regeneration rates), and automated quality scores on sampled requests.
3
Human evaluation
The gold standard but does not scale. Use it to calibrate automated metrics — run periodic rounds and check whether automated scores correlate with human judgment.

Key Metrics

Faithfulness
Is the answer supported by the provided context? An unfaithful answer is a hallucination. Critical for RAG systems.
Answer Relevance
Does the response actually address the question? A faithful but irrelevant answer is technically grounded but useless.
Latency (TTFT)
Time to first token matters more than total generation time for streaming. Users perceive responsiveness based on TTFT.
Cost per Query
Embedding + retrieval + LLM inference + post-processing. Monitoring cost alongside quality prevents runaway spend.

The Latency-Cost-Quality Triangle

Every design decision involves trading between these three:

ChangeImpact
Larger modelBetter quality, higher latency and cost
More retrieved chunksBetter recall, more tokens = higher cost and latency
Re-ranking stepBetter precision, adds 100-200ms latency
Smaller/distilled modelLower cost and latency, may reduce quality
Caching frequent queriesLower latency and cost, stale results risk
Interview Tip

Being able to articulate where you would spend your "budget" on each axis — and why — demonstrates production engineering maturity. Do not just name the tradeoffs; say which side you would pick for the given use case.

Building an Eval Suite

Golden dataset
Curated examples where you know the correct answer. Start here. 50-100 high-quality cases beats 1000 noisy ones.
Adversarial cases
Inputs designed to trigger known failures: ambiguous queries, contradictory context, out-of-scope questions.
Regression cases
Real production failures you have fixed. Add them to the suite so they stay fixed permanently.
LLM-as-judge
A separate model evaluates outputs against criteria. Imperfect but scales. Calibrate against human ratings regularly.
Regression Detection

When changing models, prompts, or retrieval parameters: run your eval suite before and after. Use paired comparisons (same inputs, different configs) with sufficient sample sizes. Set minimum thresholds — e.g., faithfulness must stay above 0.85 on your golden dataset to ship.

Production Monitoring Checklist

Request metrics
Volume, error rates, latency percentiles (p50, p95, p99)
Cost tracking
Per-request cost trending, broken down by component (embedding, retrieval, generation)
Quality scores
Automated scores on sampled traffic — separate by feature area to catch localized regressions
User signals
Thumbs up/down, regeneration rates, escalation to human
Common Mistake

Aggregate quality scores hide localized regressions — your summarization feature might degrade while Q&A stays fine. Always segment metrics by use case.

How to Explain This in an Interview

1
Frame as multi-layered
"I would approach evaluation in three layers — offline eval before deployment, automated monitoring on production traffic, and periodic human evaluation to calibrate."
2
Select metrics for the use case
Customer-facing RAG? Emphasize faithfulness and latency. Internal tool? Emphasize relevance and cost efficiency. Show you pick metrics based on context.
3
Emphasize continuous evaluation
Evaluation is not one-time. Model behavior drifts as user patterns change. Mention regression testing on every change.

Common Interview Questions

What to Practice Next

Browse all LLM Evaluation & Ops interview questions for detailed walkthroughs.

Next module: AI System Design: How to Approach Any AI Architecture Question

Practice Questions

View all llm evaluation and ops questions

Test your knowledge

Reading is step one. Practice with questions on this topic to reinforce what you've learned.

Browse practice questions