Build a system to detect when LLM output quality degrades — covering statistical monitoring, automated quality checks, and incident response.

Learn how to detect and handle LLM output regressions. Covers statistical monitoring, LLM-as-judge, user signals, alerting, and incident response.

Detect LLM Output Regressions in Production - Interview Question

Why This Is Asked

LLM regressions are subtle and dangerous — they do not throw exceptions, they just produce slightly worse outputs that users silently suffer through. Interviewers want to see if you have thought about how to detect quality drops before they become disasters.

Key Concepts to Cover

What causes regressions — model updates, prompt changes, input drift, infrastructure changes
Statistical process control — detecting when metrics go outside normal bounds
LLM-as-judge monitoring — running quality checks on production samples
User signal monitoring — thumbs down rates, re-query rates as proxy metrics
Alerting thresholds — what warrants a page vs. an investigation
Incident response — steps to take when a regression is confirmed
Root cause analysis — isolating whether it is model, prompt, or data

How to Approach This

1. Understand the Sources of Regression

Regressions can come from:

Model updates: Provider silently updates the model
Prompt changes: A well-intentioned edit causes unexpected behavior on edge cases
Input distribution shift: Users start sending different types of queries
Infrastructure changes: Different temperature settings, max token limits, system prompt changes

2. Statistical Monitoring

Track key metrics with control charts. For each metric, calculate rolling mean and standard deviation over 7 days. A 2-sigma threshold fires on ~5% of measurements by chance, which is too noisy for continuously monitored metrics. Use 3-sigma for single-metric alerts, or combine a lower threshold with a persistence gate (e.g., alert only if the metric stays outside bounds for 2+ consecutive hours):

def check_regression(metric_name, current_value, historical_values):
    mean = statistics.mean(historical_values)
    std = statistics.stdev(historical_values)
    z_score = (current_value - mean) / std
    if abs(z_score) > 3:  # 3-sigma: ~0.3% false positive rate
        alert(f"{metric_name} regression detected: z={z_score:.2f}")

3. Automated Quality Sampling

Every hour, sample 50-100 production requests and evaluate with LLM-as-judge:

Run sampled inputs through a quality rubric
Compare average scores to the baseline from the previous week
Alert if score drops by more than a configured threshold

4. User Signal Monitoring

Correlate user behavior changes with potential regressions:

Thumbs down rate: A 20% increase in downvotes signals a quality problem
Re-query rate: Users re-asking the same question means the first answer failed
Session abandonment: Users leaving after an AI response suggests disappointment

5. Incident Response Playbook

When a regression is detected:

Confirm it is real: Check if the signal persists over 1-2 hours
Assess impact: What % of users are affected? How severe?
Isolate the cause: Did anything change in the last 24 hours?
Mitigate fast: Roll back the most recent change if possible
Fix properly: Understand root cause, fix, test in staging, canary deploy
Post-mortem: Document cause, detection time, impact, and improvements

Common Follow-ups

"How do you handle a regression that only affects a specific user segment?" Segment your metrics by user type, query category. Build dashboards that slice metrics by these dimensions and alert on per-segment degradation.
"What if the regression is gradual — quality slowly degrading over weeks?" Compare current week's average to 4 weeks ago. Set up a weekly automated report of metric trends so gradual decay is visible even if no threshold is crossed.
"How do you separate signal from noise in quality monitoring?" Use multiple independent signals: if LLM-as-judge score drops AND thumbs-down rate increases, that is a true regression. Build confidence intervals and only alert when the interval clearly crosses a threshold.

How Would You Detect and Handle LLM Output Regressions?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Understand the Sources of Regression

2. Statistical Monitoring

3. Automated Quality Sampling

4. User Signal Monitoring

5. Incident Response Playbook

Common Follow-ups

Related Questions

How Do You Build an Eval Suite for an LLM-Powered Feature?

What Metrics Would You Track for an LLM in Production?

How Do You Handle Model Version Upgrades Without Breaking Production?

Prep for the full interview loop