Design a unified gateway that routes requests across multiple LLM providers, handles fallbacks, enforces rate limits, and tracks costs per team.

Learn how to architect a multi-model AI gateway. Covers model routing, fallback, rate limiting, cost tracking, caching, and observability.

Design a Multi-Model AI Gateway - Interview Question

Why This Is Asked

As companies mature their AI usage, they need platform-level infrastructure — which models to use for which tasks, how to handle provider outages, how to control costs, and how to maintain observability.

Key Concepts to Cover

Model routing — selecting the right model for each request
Fallback logic — handling provider outages or rate limit errors
Rate limiting — per-team, per-model quotas
Cost tracking — token usage attribution per team/feature
Caching — semantic and exact caching to reduce costs
Observability — logging prompts, responses, latency, and costs
Unified API — abstracting provider-specific APIs behind a common interface

How to Approach This

1. Clarify Requirements

How many teams/applications?
What providers need to be supported?
Is cost or reliability the primary concern?
Centrally managed API keys or bring-your-own?

2. High-Level Architecture

Client Application
      ↓
AI Gateway API (unified interface)
      ├── Auth & Rate Limiter
      ├── Router (model selection logic)
      ├── Cache (exact + semantic)
      ├── Provider Adapters (OpenAI / Anthropic / Gemini / local)
      ├── Fallback Handler
      └── Observability (logging, metrics, tracing)

3. Model Routing

Route requests based on:

Task type: classification → small fast model; reasoning → large model
Cost budget: standard vs. premium tier
Latency SLA: streaming required?
Content policy: sensitive content → stricter safety filters

4. Caching

Exact cache: hash the full prompt → cache response
Semantic cache: embed the prompt, find nearest cached response above similarity threshold. Important caveat: semantic caches can return stale or incorrect responses if the cached response was generated by an older model version, with different context, or before the underlying data changed. Tag cached entries with model version and apply TTL policies — semantic caches need more aggressive invalidation than exact caches.
Cache invalidation: TTL-based plus explicit invalidation on prompt template changes or model version upgrades

5. Observability

Every request logs: model used, provider, latency, token counts, cost, team attribution, cache hit status, error type.

Common Follow-ups

"How would you handle a provider outage mid-request?" Retry with exponential backoff, then failover to next provider, circuit breaker pattern.
"How do you prevent one team from consuming all quota?" Per-team token budgets with hard and soft limits, priority queuing, real-time quota dashboards.
"How would you support streaming responses through the gateway?" SSE or WebSocket pass-through, buffering strategies, token count attribution at completion.

How Would You Architect a Multi-Model AI Gateway?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Clarify Requirements

2. High-Level Architecture

3. Model Routing

4. Caching

5. Observability

Common Follow-ups

Related Questions

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

How Do You Handle Model Version Upgrades Without Breaking Production?

How Would You Detect and Handle LLM Output Regressions?

Prep for the full interview loop