Advanced3 min read

How Would You Architect a Multi-Model AI Gateway?

Design a unified gateway that routes requests across multiple LLM providers, handles fallbacks, enforces rate limits, and tracks costs per team.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

As companies mature their AI usage, they need platform-level infrastructure — which models to use for which tasks, how to handle provider outages, how to control costs, and how to maintain observability.

Key Concepts to Cover

  • Model routing — selecting the right model for each request
  • Fallback logic — handling provider outages or rate limit errors
  • Rate limiting — per-team, per-model quotas
  • Cost tracking — token usage attribution per team/feature
  • Caching — semantic and exact caching to reduce costs
  • Observability — logging prompts, responses, latency, and costs
  • Unified API — abstracting provider-specific APIs behind a common interface

How to Approach This

1. Clarify Requirements

  • How many teams/applications?
  • What providers need to be supported?
  • Is cost or reliability the primary concern?
  • Centrally managed API keys or bring-your-own?

2. High-Level Architecture

Client Application
      ↓
AI Gateway API (unified interface)
      ├── Auth & Rate Limiter
      ├── Router (model selection logic)
      ├── Cache (exact + semantic)
      ├── Provider Adapters (OpenAI / Anthropic / Gemini / local)
      ├── Fallback Handler
      └── Observability (logging, metrics, tracing)

3. Model Routing

Route requests based on:

  • Task type: classification → small fast model; reasoning → large model
  • Cost budget: standard vs. premium tier
  • Latency SLA: streaming required?
  • Content policy: sensitive content → stricter safety filters

4. Caching

  • Exact cache: hash the full prompt → cache response
  • Semantic cache: embed the prompt, find nearest cached response above similarity threshold. Important caveat: semantic caches can return stale or incorrect responses if the cached response was generated by an older model version, with different context, or before the underlying data changed. Tag cached entries with model version and apply TTL policies — semantic caches need more aggressive invalidation than exact caches.
  • Cache invalidation: TTL-based plus explicit invalidation on prompt template changes or model version upgrades

5. Observability

Every request logs: model used, provider, latency, token counts, cost, team attribution, cache hit status, error type.

Common Follow-ups

  1. "How would you handle a provider outage mid-request?" Retry with exponential backoff, then failover to next provider, circuit breaker pattern.

  2. "How do you prevent one team from consuming all quota?" Per-team token budgets with hard and soft limits, priority queuing, real-time quota dashboards.

  3. "How would you support streaming responses through the gateway?" SSE or WebSocket pass-through, buffering strategies, token count attribution at completion.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]