Why This Is Asked
As companies mature their AI usage, they need platform-level infrastructure — which models to use for which tasks, how to handle provider outages, how to control costs, and how to maintain observability.
Key Concepts to Cover
- Model routing — selecting the right model for each request
- Fallback logic — handling provider outages or rate limit errors
- Rate limiting — per-team, per-model quotas
- Cost tracking — token usage attribution per team/feature
- Caching — semantic and exact caching to reduce costs
- Observability — logging prompts, responses, latency, and costs
- Unified API — abstracting provider-specific APIs behind a common interface
How to Approach This
1. Clarify Requirements
- How many teams/applications?
- What providers need to be supported?
- Is cost or reliability the primary concern?
- Centrally managed API keys or bring-your-own?
2. High-Level Architecture
Client Application
↓
AI Gateway API (unified interface)
├── Auth & Rate Limiter
├── Router (model selection logic)
├── Cache (exact + semantic)
├── Provider Adapters (OpenAI / Anthropic / Gemini / local)
├── Fallback Handler
└── Observability (logging, metrics, tracing)
3. Model Routing
Route requests based on:
- Task type: classification → small fast model; reasoning → large model
- Cost budget: standard vs. premium tier
- Latency SLA: streaming required?
- Content policy: sensitive content → stricter safety filters
4. Caching
- Exact cache: hash the full prompt → cache response
- Semantic cache: embed the prompt, find nearest cached response above similarity threshold. Important caveat: semantic caches can return stale or incorrect responses if the cached response was generated by an older model version, with different context, or before the underlying data changed. Tag cached entries with model version and apply TTL policies — semantic caches need more aggressive invalidation than exact caches.
- Cache invalidation: TTL-based plus explicit invalidation on prompt template changes or model version upgrades
5. Observability
Every request logs: model used, provider, latency, token counts, cost, team attribution, cache hit status, error type.
Common Follow-ups
-
"How would you handle a provider outage mid-request?" Retry with exponential backoff, then failover to next provider, circuit breaker pattern.
-
"How do you prevent one team from consuming all quota?" Per-team token budgets with hard and soft limits, priority queuing, real-time quota dashboards.
-
"How would you support streaming responses through the gateway?" SSE or WebSocket pass-through, buffering strategies, token count attribution at completion.