Walk through the architecture of a production LLM-powered chat system — covering streaming responses, conversation history management, context window limits, multi-user scaling, and safety.

System design for a production LLM chat application. Covers streaming, session management, context window management, scaling, and safety layers.

Design a Production LLM Chat System - AI Interview Question

Why This Is Asked

"Design ChatGPT" (or a variant: "design an LLM chatbot at scale") is one of the most common senior AI system design questions. It tests depth across streaming protocols, stateful session management, context window engineering, and large-scale serving — all in one question.

Key Concepts to Cover

Streaming — why and how responses are streamed to users
Session and conversation management — storing and retrieving history
Context window management — handling long conversations that exceed token limits
Multi-user scaling — connection management, load balancing, isolation
Safety and guardrails — input/output filtering layers
Cost and latency — the key production tradeoffs

How to Approach This

1. Clarify Requirements First

Ask before designing:

What scale? (1K concurrent users vs. 10M DAU)
What modalities? (text only, or also images/files/voice?)
Is memory across sessions required? (user remembers past conversations)
What latency SLA? (first token < 500ms?)
Any compliance requirements? (PII handling, data residency)

2. High-Level Architecture

Client
  ↓ HTTPS / WebSocket
API Gateway (auth, rate limiting)
  ↓
Chat Service
  ├─ Session Store (Redis) — active conversation state
  ├─ History DB (Postgres/DynamoDB) — persisted messages
  └─ LLM Gateway (load balancing across LLM endpoints)
        ↓
    LLM API / Self-hosted inference cluster
        ↓ (streaming tokens)
    Chat Service → SSE/WebSocket → Client

3. Streaming: Why It Matters

Users perceive streaming as dramatically faster even with the same total latency. A response that takes 5 seconds to generate but streams word-by-word feels faster than a 2-second response that appears all at once.

Implementation:

LLM APIs support streaming via Server-Sent Events (SSE)
Backend reads the stream and forwards tokens to the client over WebSocket or SSE
Frontend renders tokens as they arrive

Tradeoff: streaming complicates error handling — if the model fails mid-stream, you must handle a partial response gracefully. Buffer the full response server-side if you need to run output filters (safety, PII) before sending to the client.

4. Conversation History Management

Each request to the LLM must include prior turns as context. This is state.

Storage layer:

Active session: Redis (fast reads, TTL for cleanup)
Persisted history: Postgres or DynamoDB (long-term storage per user)

Message format (OpenAI-compatible):

[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is RAG?"},
  {"role": "assistant", "content": "RAG is..."},
  {"role": "user", "content": "How do I implement it?"}
]

Each request reconstructs this array from stored history and sends it to the LLM.

5. Context Window Management

Conversations grow indefinitely; context windows are finite (128K tokens is large but not unlimited). Strategies:

Rolling window: keep only the last N turns. Simple, but loses early context (user's name, goals set at the start).

Summarization: when history exceeds a threshold, use the LLM to summarize older turns into a compact memory block. Inject the summary at the start of context. Preserves key information.

RAG over history: for very long sessions, embed past messages in a vector store and retrieve relevant ones per query. Expensive to build, powerful for long-running agents.

Tiered approach (recommended):

Always include system prompt + last 5 turns (recency bias)
Inject a compressed summary of older turns
Retrieve up to 3 semantically relevant older turns if the query seems to reference past context

6. Multi-User Scaling

Stateless API tier: each chat request includes session ID; service loads history from Redis/DB
WebSocket management: maintain connection pools; route user connections to the same backend instance to avoid re-establishing WebSocket sessions (sticky sessions or a connection registry)
LLM request queue: under load, queue LLM requests rather than dropping them; prioritize paying/premium users
Rate limiting: per-user token budget (daily/monthly), per-IP request limiting

7. Safety and Guardrails

Two layers:

Input guardrails (before sending to LLM):

Content classifier: detect prompt injection, jailbreaks, policy violations
PII detection: optionally redact before sending to external LLM APIs
Fast, low-latency — run with a small classifier model or rule-based filter

Output guardrails (before sending to user):

Toxicity filter on model output
PII detection on responses (in case model echoes sensitive input)
Safety threshold varies by use case (consumer product vs. developer tool)

Tradeoff: output filters require buffering the full response, adding latency and breaking streaming UX. Consider async post-processing + flagging rather than blocking for most cases.

Common Follow-ups

"How would you handle a conversation that references an image the user uploaded?" Store the image in object storage (S3); include a reference ID in the conversation history. At query time, fetch the image, encode it as base64, and include it in the multimodal LLM request alongside the text history.
"How do you prevent users from extracting the system prompt?" You can't fully prevent it — the LLM has access to the system prompt and can be instructed to reveal it. Mitigations: instruct the model not to repeat the system prompt, use server-side prompt injection detection, but accept this as a known limitation.
"How would you design user memory across sessions?" Extract facts from conversations (name, preferences, past topics) using a structured extraction prompt. Store as a user profile in Postgres. Inject relevant profile facts as part of the system prompt for future sessions.

Design a Production LLM Chat System (Design ChatGPT)