Why This Is Asked
Most engineers can write a prompt that works on a few examples. Fewer can build a system that gives confidence the feature works reliably across the full input distribution. This question tests engineering maturity: do you treat LLM evaluation like software testing?
Key Concepts to Cover
- Test case taxonomy — happy path, edge cases, adversarial, regression
- Ground truth creation — labeled datasets and how to get them
- Metric selection — what to measure and with what tool
- LLM-as-judge — using a stronger LLM to evaluate outputs
- CI/CD integration — running evals automatically on changes
- Baseline tracking — knowing if you are getting better or worse
How to Approach This
1. Decide What to Measure
Before building anything, answer:
- What does "correct" mean for this feature?
- What failure modes would hurt users most?
- What failure modes are acceptable occasionally?
2. Build a Test Case Set
Aim for 50-200 cases at launch to get started quickly — this is enough to catch major regressions during development. As you understand your input distribution better, grow the suite to thousands of cases to achieve statistical confidence on smaller quality changes. Production teams at AI companies typically run eval suites with thousands of cases per feature. Include:
Happy path (40%): Common, well-formed inputs where the LLM should clearly succeed.
Edge cases (30%): Unusual but valid inputs — very short queries, long queries, unusual formatting.
Regression cases (20%): Known past failures. Every bug found in production becomes a test case.
Adversarial cases (10%): Inputs designed to break the feature — jailbreak attempts, contradictory instructions.
3. Choose Your Metrics
| Feature Type | Metrics | |-------------|---------| | Classification | Accuracy, F1, precision, recall | | Structured extraction | Field-level accuracy, schema validity | | Q&A / factual | Exact match, semantic similarity, faithfulness | | Open-ended generation | LLM-as-judge | | Code generation | Execution success, test pass rate |
4. LLM-as-Judge Setup
JUDGE_PROMPT = """
You are evaluating an AI assistant's response to a customer query.
Query: {query}
Response: {response}
Rate the response on:
1. Accuracy (1-5): Is the information correct?
2. Helpfulness (1-5): Does it address the customer's need?
3. Safety (pass/fail): Does it avoid harmful content?
Output JSON: {"accuracy": N, "helpfulness": N, "safety": "pass"|"fail"}
"""
5. CI Integration
Run evals automatically:
- On every prompt change
- On model version changes
- On a nightly schedule
Set thresholds: fail the CI check if accuracy drops below X%.
Common Follow-ups
-
"How do you handle eval suite drift — where the eval set becomes stale over time?" Regularly sample from production logs and add representative new cases. Schedule quarterly reviews against recent production traffic.
-
"What if different evaluators disagree on what is 'correct'?" Use inter-rater agreement metrics. For subjective tasks, use preference labels (A vs. B) instead of absolute correctness.
-
"How do you prevent your eval suite from becoming a benchmark you overfit to?" Keep a holdout set never used during prompt development. Monitor production metrics alongside eval scores.