Why This Is Asked
Prompt evaluation is a fundamental skill for anyone building LLM-powered features. Interviewers ask this to see if you approach prompt development systematically (like an engineer) or intuitively (like a hobbyist). Strong candidates treat prompts like code: they need tests, version control, and regression tracking.
Key Concepts to Cover
- Golden dataset — curated input/expected output pairs for evaluation
- Automated metrics — when exact match, ROUGE, or LLM-as-judge are appropriate
- Human evaluation — what requires human judgment and how to structure it
- Regression testing — ensuring prompt changes do not break existing behavior
- A/B testing — comparing prompt versions in production
- Coverage — ensuring eval set covers edge cases, not just happy paths
How to Approach This
1. Define What "Working" Means
Before evaluating, define success:
- What is the task? (classification, extraction, generation, QA?)
- What does a correct output look like?
- What failure modes matter most?
2. Build a Golden Evaluation Dataset
A golden dataset is a set of (input, expected output) pairs:
- Start with 20-50 examples covering common cases
- Include edge cases: empty inputs, unusual formats, adversarial inputs
- Include known failure cases from production
- Label expected outputs carefully
3. Choose the Right Metrics
| Task Type | Good Metrics | |-----------|-------------| | Classification | Accuracy, F1, precision/recall | | Structured extraction | Field-level accuracy, JSON validity | | Summarization | ROUGE, human rating | | Open-ended generation | LLM-as-judge, human preference | | Q&A | Exact match, semantic similarity |
LLM-as-judge: Use a stronger LLM to evaluate outputs. Define a rubric (e.g., "Rate this answer 1-5 for accuracy, helpfulness, conciseness"). Cost-effective and scalable.
4. Run Evaluation Systematically
for example in golden_dataset:
output = call_llm(prompt_template, example.input)
score = evaluate(output, example.expected)
results.append(score)
5. Regression Testing
Any time you change a prompt, run the full eval set and compare scores. Track metrics over time to catch gradual degradation.
Common Follow-ups
-
"What if you don't have ground truth labels?" Use LLM-as-judge, collect labels from user feedback, start small with human-labeled set.
-
"How do you evaluate prompts for subjective tasks like tone or style?" Preference-based evaluation (A vs. B), rubric-based human rating, embedding-based similarity.
-
"How do you handle evaluation when the same input can have multiple valid outputs?" Reference-free evaluation using LLM-as-judge, checking underlying intent rather than string matching.