What Is LLM Evaluation and Why Do Interviewers Ask About It
LLM outputs are probabilistic — the same input can produce different outputs, and "correctness" is often subjective. Evaluation is the discipline of measuring whether your AI system produces correct, useful, and safe outputs despite this uncertainty.
Evaluation questions separate engineers who have shipped AI in production from those who built demos. Without eval, you are flying blind — and interviewers know it.
Three Layers of Evaluation
Key Metrics
The Latency-Cost-Quality Triangle
Every design decision involves trading between these three:
| Change | Impact |
|---|---|
| Larger model | Better quality, higher latency and cost |
| More retrieved chunks | Better recall, more tokens = higher cost and latency |
| Re-ranking step | Better precision, adds 100-200ms latency |
| Smaller/distilled model | Lower cost and latency, may reduce quality |
| Caching frequent queries | Lower latency and cost, stale results risk |
Being able to articulate where you would spend your "budget" on each axis — and why — demonstrates production engineering maturity. Do not just name the tradeoffs; say which side you would pick for the given use case.
Building an Eval Suite
When changing models, prompts, or retrieval parameters: run your eval suite before and after. Use paired comparisons (same inputs, different configs) with sufficient sample sizes. Set minimum thresholds — e.g., faithfulness must stay above 0.85 on your golden dataset to ship.
Production Monitoring Checklist
Aggregate quality scores hide localized regressions — your summarization feature might degrade while Q&A stays fine. Always segment metrics by use case.
How to Explain This in an Interview
Common Interview Questions
- Build an LLM Eval Suite — Designing a comprehensive evaluation framework
- Detect Output Regressions — Catching quality degradation on model/prompt changes
- Latency-Cost-Quality Tradeoffs — Navigating the three-way tension
- Production LLM Metrics — What to monitor in a live AI system
- RAG Evaluation Metrics — Measuring retrieval and generation quality
What to Practice Next
Browse all LLM Evaluation & Ops interview questions for detailed walkthroughs.
Next module: AI System Design: How to Approach Any AI Architecture Question