Why This Is Asked
Deploying an LLM without proper monitoring is flying blind. Interviewers ask this to see if you think about AI systems operationally — not just "does it work" but "how do you know when it stops working?"
Key Concepts to Cover
- Latency metrics — p50, p90, p99 response times, time-to-first-token
- Cost metrics — token usage, cost per request, cost per user
- Error rates — API failures, timeout rates, refusal rates
- Quality metrics — output length, format validity, LLM-as-judge scores
- User signals — thumbs up/down, follow-up query rate, session abandonment
- Drift detection — monitoring for changes in input distribution
How to Approach This
1. Operational Metrics (Infrastructure Health)
Latency:
- p50, p90, p99 end-to-end response time
- Time-to-first-token (for streaming) — users perceive this as responsiveness
- LLM API call latency separate from total request latency
Availability:
- API error rate (5xx from provider)
- Timeout rate
- Rate limit hit frequency
Cost:
- Prompt tokens per request (input cost)
- Completion tokens per request (output cost)
- Cost per request, cost per day, cost per user
2. Output Quality Metrics
Format validity:
- If you expect JSON, what % of responses are valid JSON?
Length distribution:
- Very short responses may indicate refusal or failure
- Very long responses may indicate runaway generation
Refusal rate:
- How often does the model refuse to answer?
- Sudden spikes may indicate a model update or prompt issue
LLM-as-judge:
- Run a sample of production outputs through an evaluator daily
- Track quality scores over time with alerts on degradation
3. User Satisfaction Signals
Explicit signals: Thumbs up/down ratings, "Was this helpful?" prompts.
Implicit signals:
- Follow-up query rate (users re-asking suggests the first answer failed)
- Session abandonment after AI response
- Copy rate (users copy the output — signals it was useful)
4. Drift Detection
Model behavior can change without any code change:
- LLM providers silently update models
- Input distribution shifts as users adopt the feature
Monitor rolling averages of quality metrics, response length distribution, topic distribution of incoming queries.
5. Dashboard and Alerting
- Page immediately: error rate > 5%, p99 latency > 10s, cost spike > 300%
- Investigate same day: quality score drop > 10%, refusal rate spike
- Weekly review: cost trends, satisfaction trends, new failure patterns
Common Follow-ups
-
"How do you monitor for prompt injection attacks?" Log and scan inputs for common injection patterns, monitor for unusual output patterns, rate limit aggressive users, anomaly detection on output content.
-
"How do you attribute costs to specific features or teams?" Tag every LLM API call with metadata (feature name, team, user segment). Calculate cost by tag. Show each team their own cost dashboard.
-
"What is the most important metric to watch if you could only pick one?" User satisfaction signal (thumbs down rate or follow-up query rate) — it is the closest proxy for real-world quality.