LLM Evaluation Interview Guide — Metrics, Evals & Monitoring

What Is LLM Evaluation and Why Do Interviewers Ask About It

LLM outputs are probabilistic — the same input can produce different outputs, and "correctness" is often subjective. Evaluation is the discipline of measuring whether your AI system produces correct, useful, and safe outputs despite this uncertainty.

Interview Tip

Evaluation questions separate engineers who have shipped AI in production from those who built demos. Without eval, you are flying blind — and interviewers know it.

Three Layers of Evaluation

Offline evaluation

Run against a fixed dataset before deployment. Compare model outputs to ground truth or use a judge model. This is your pre-deployment safety net.

Online monitoring

Track live production traffic: latency, error rates, user feedback signals (thumbs up/down, regeneration rates), and automated quality scores on sampled requests.

Human evaluation

The gold standard but does not scale. Use it to calibrate automated metrics — run periodic rounds and check whether automated scores correlate with human judgment.

Key Metrics

Faithfulness

Is the answer supported by the provided context? An unfaithful answer is a hallucination. Critical for RAG systems.

Answer Relevance

Does the response actually address the question? A faithful but irrelevant answer is technically grounded but useless.

Latency (TTFT)

Time to first token matters more than total generation time for streaming. Users perceive responsiveness based on TTFT.

Cost per Query

Embedding + retrieval + LLM inference + post-processing. Monitoring cost alongside quality prevents runaway spend.

The Latency-Cost-Quality Triangle

Every design decision involves trading between these three:

Change	Impact
Larger model	Better quality, higher latency and cost
More retrieved chunks	Better recall, more tokens = higher cost and latency
Re-ranking step	Better precision, adds 100-200ms latency
Smaller/distilled model	Lower cost and latency, may reduce quality
Caching frequent queries	Lower latency and cost, stale results risk

Interview Tip

Being able to articulate where you would spend your "budget" on each axis — and why — demonstrates production engineering maturity. Do not just name the tradeoffs; say which side you would pick for the given use case.

Building an Eval Suite

Golden dataset

Curated examples where you know the correct answer. Start here. 50-100 high-quality cases beats 1000 noisy ones.

Adversarial cases

Inputs designed to trigger known failures: ambiguous queries, contradictory context, out-of-scope questions.

Regression cases

Real production failures you have fixed. Add them to the suite so they stay fixed permanently.

LLM-as-judge

A separate model evaluates outputs against criteria. Imperfect but scales. Calibrate against human ratings regularly.

Regression Detection

When changing models, prompts, or retrieval parameters: run your eval suite before and after. Use paired comparisons (same inputs, different configs) with sufficient sample sizes. Set minimum thresholds — e.g., faithfulness must stay above 0.85 on your golden dataset to ship.

Production Monitoring Checklist

Request metrics

Volume, error rates, latency percentiles (p50, p95, p99)

Cost tracking

Per-request cost trending, broken down by component (embedding, retrieval, generation)

Quality scores

Automated scores on sampled traffic — separate by feature area to catch localized regressions

User signals

Thumbs up/down, regeneration rates, escalation to human

Common Mistake

Aggregate quality scores hide localized regressions — your summarization feature might degrade while Q&A stays fine. Always segment metrics by use case.

How to Explain This in an Interview

Frame as multi-layered

"I would approach evaluation in three layers — offline eval before deployment, automated monitoring on production traffic, and periodic human evaluation to calibrate."

Select metrics for the use case

Customer-facing RAG? Emphasize faithfulness and latency. Internal tool? Emphasize relevance and cost efficiency. Show you pick metrics based on context.

Emphasize continuous evaluation

Evaluation is not one-time. Model behavior drifts as user patterns change. Mention regression testing on every change.

Common Interview Questions

Build an LLM Eval Suite — Designing a comprehensive evaluation framework
Detect Output Regressions — Catching quality degradation on model/prompt changes
Latency-Cost-Quality Tradeoffs — Navigating the three-way tension
Production LLM Metrics — What to monitor in a live AI system
RAG Evaluation Metrics — Measuring retrieval and generation quality

What to Practice Next

Browse all LLM Evaluation & Ops interview questions for detailed walkthroughs.

Next module: AI System Design: How to Approach Any AI Architecture Question

LLM Evaluation: Metrics, Evals, and Production Monitoring

What Is LLM Evaluation and Why Do Interviewers Ask About It

Three Layers of Evaluation

Key Metrics

The Latency-Cost-Quality Triangle

Building an Eval Suite

Production Monitoring Checklist

How to Explain This in an Interview

Common Interview Questions

What to Practice Next

Practice Questions

How Do You Build an Eval Suite for an LLM-Powered Feature?

How Would You Detect and Handle LLM Output Regressions?

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

What Metrics Would You Track for an LLM in Production?

How Do You Evaluate a RAG System End-to-End?

Test your knowledge