A few years ago, the classic system design interview had a pretty predictable format. Design Twitter's feed. Design a URL shortener. Show you understand load balancing, caching, database sharding.
That content isn't going away, but at any company seriously building with AI — and honestly, that's most companies now — a new type of question has crept in. "Design a document Q&A system." "Build an AI support agent that can hand off to humans." "How would you serve an LLM to 50,000 concurrent users while keeping costs under control?"
These questions look similar on the surface, but they require a genuinely different knowledge base. And most candidates are walking in with the classical playbook and getting caught flat-footed.
Here's what you actually need to know.
What makes AI system design different (and harder)
The main thing is: the systems you're designing are probabilistic. That breaks a lot of assumptions you're used to making.
If an interviewer asks you to "design a document Q&A system" and you just draw boxes and arrows without mentioning any of this, they'll know you haven't built one of these in production.
The five areas you need to know
The RAG pipeline in detail
Since RAG questions come up constantly, it's worth understanding the full data flow — both the indexing side (done offline) and the query side (done at request time):
The failure modes interviewers love to dig into: what happens when the relevant answer is split across two chunks? What if the embedding model's distribution drifts from your documents? What if the top-k retrieved chunks are all subtly wrong? If you can talk through these confidently, you stand out.
Hallucination in RAG systems often happens not when retrieval fails entirely, but when it returns something plausible-but-wrong. The LLM synthesizes confidently from bad context. Detecting this in production is a whole engineering problem — and a great thing to bring up proactively.
How to structure your answer
Same basic shape as classical system design, with additions. Here's the arc:
Clarify requirements first (5 minutes). Don't assume. Ask: How many users? What's the query volume? What does "correct" mean here — is hallucination ever acceptable? What's the latency target? What's the data freshness requirement? Is cost a hard constraint?
A real-time customer-facing chat system is a completely different design from an internal nightly batch report generator, even if both use an LLM.
High-level architecture (5–8 minutes). Draw the major components and data flows. For most AI systems: client/API layer, orchestration layer, LLM inference, retrieval layer (if RAG), storage (vector DB + document store + metadata DB + cache), and observability/eval infrastructure.
Name your choices with a sentence of reasoning. "I'd use a managed API rather than self-hosting because at this scale the engineering overhead isn't justified — though I'd flag the data privacy consideration."
Deep dive on what matters (10–15 minutes). The interviewer usually steers you toward what they care about. If they don't, go deep on the riskiest components — typically retrieval quality, latency, or evaluation.
Tradeoffs and bottlenecks (5 minutes). Be honest about the weak points in your design. "The biggest fragility here is retrieval quality — if chunking boundaries are wrong, we'll miss relevant content. I'd mitigate that with an eval harness tracking retrieval recall on a golden set."
Most interviewers care more about how you reason about tradeoffs than which specific choices you make. A confident "I would choose X over Y because of Z, even though it trades away W" is a much stronger signal than picking the optimal option without explanation.
What separates the candidates who built this from those who read about it
| Candidate who read about it | Candidate who built it |
|---|---|
| "Use RAG for grounding" | Explains why retrieval helps, names specific failure modes, describes how they would detect them |
| Names components without tradeoffs | "Managed API gives us simplicity but creates vendor lock-in and data residency issues — here is how I would think about that" |
| Generic eval plan: "we would test it" | "We would track faithfulness and context recall via RAGAS offline, and use thumbs-up/down plus LLM-as-judge sampling in production" |
| Does not mention cost | Brings up prompt caching, context window management, and model tiering as first-class engineering concerns |
The good news: you can fake a lot of this with good preparation. Go deep enough that you can reason about the why behind each component, not just its name. If you cannot explain in plain language why KV caching reduces inference costs, or why hybrid search outperforms pure vector search for certain queries, you will get exposed in the follow-up questions.
A 4-week plan if you're starting from scratch
One last thing
AI system design is still a relatively new interview category. Most candidates haven't done dedicated preparation for it — they're hoping general system design knowledge transfers. Some of it does. But the candidates who've internalized the AI-specific pieces — the eval infrastructure, the retrieval quality concerns, the inference cost tradeoffs — stand out pretty dramatically.
The gap is closeable. The knowledge is learnable. And the category is only growing.
The Rubduck AI system design learning guide covers RAG, LLM evaluation, prompt engineering, and AI agents with interview framing throughout. The question bank has 25+ AI interview questions with detailed answers — good practice before you're doing it live.