Why This Is Asked
Multi-agent systems represent the frontier of applied AI engineering. They're also notoriously difficult to make reliable. Interviewers use this question to see if you can reason about emergent complexity — how individual agents' behaviors compose (or fail to compose) into a coherent system.
Key Concepts to Cover
- Orchestrator-subagent pattern — one agent coordinates many specialized agents
- Agent specialization — why specialized agents outperform generalist agents
- Communication protocols — how agents pass information and tasks
- Parallelization — running independent subtasks in parallel
- Error handling — what happens when a subagent fails
- Context sharing — how agents share state without redundancy
- Avoiding agent loops — preventing circular dependencies
- Observability — tracing execution across multiple agents
How to Approach This
1. Why Multi-Agent?
A single agent trying to write, review, test, and document code:
- Has a limited context window (can't hold all code in memory)
- Makes errors harder to isolate (which part of the task failed?)
- Can't parallelize subtasks
- Can't specialize — "jack of all trades, master of none"
Specialized agents each excel at their domain. The system is more capable than any single agent.
2. System Architecture: Orchestrator + Specialists
User Request
|
Orchestrator Agent
|-- Planner: break request into subtasks
|-- Assigns to specialists:
| |-- Architect Agent: design system, propose file structure
| |-- Coder Agent: implement code for each module
| |-- Reviewer Agent: review code for bugs and style
| |-- Test Agent: write and run tests
| |-- Doc Agent: generate documentation
|-- Synthesizer: combine outputs into final deliverable
3. Communication Patterns
Message passing: Agents communicate through a shared message bus or queue. Each agent reads tasks addressed to it and writes results to a shared store.
Shared state: A central state object (JSON document) that all agents read from and write to. Each agent updates the portion of state it owns.
Direct handoff: Orchestrator directly calls each subagent in sequence or in parallel, passing context explicitly. Simplest to reason about.
For most production systems, start with direct handoff — it's the easiest to debug and trace.
4. Parallelization
Independent tasks can run in parallel:
Orchestrator receives: "Build a user authentication module"
Sequential:
Architect -> Coder -> Reviewer -> Tester -> Docs (slow)
Parallel where possible:
Architect
|
Coder (backend) -- parallel -- Coder (frontend)
| |
Reviewer ---------- parallel -- Reviewer
| |
(merge results)
|
Tester (integration tests)
|
Doc Agent
5. Error Handling
When a subagent fails:
- Retry: Re-run the subagent with the same input (handles transient failures)
- Fallback: The orchestrator tries a different approach for the failed subtask
- Human escalation: If retries fail, surface the failure to a human with context
- Graceful degradation: Complete the task without the failing component where possible
Crucially, the orchestrator must handle failures gracefully without causing the whole system to fail.
6. Avoiding Agent Loops
Circular dependencies (Agent A calls Agent B which calls Agent A) can cause infinite loops:
- For static workflows: Define a clear DAG (directed acyclic graph) of agent dependencies up front. If the call graph is pre-defined, cycles are structurally impossible.
- For dynamic systems: When the orchestrator decides at runtime which agents to call, a pre-defined DAG cannot always be enforced. Instead: track the full call graph per request and detect if any agent is called with the same input a second time (cycle detection), set a maximum call depth, and require each agent call to advance measurable state.
- Maximum depth limit on agent calls (e.g., no agent can be called more than 3 times per session)
- Each agent call should make progress (move state forward) — detect and break cycles where state hasn't changed
7. Observability
With multiple agents, debugging is hard without tracing:
- Assign a unique trace ID to each top-level request
- Log every agent invocation with: agent name, input, output, duration, cost
- Build a timeline view of which agents ran, in what order, with what results
- Alert on: runaway chains (too many agent calls), excessive cost per request, agent failure rates
Common Follow-ups
-
"How do you prevent agents from duplicating work?" Use a shared task registry: when an agent picks up a subtask, it marks it as "in progress." Other agents check this before starting similar work. Alternatively, the orchestrator explicitly assigns non-overlapping tasks and maintains ownership.
-
"How do you evaluate a multi-agent system?" Evaluate end-to-end (did the final output meet requirements?) and per-agent (is each agent making good decisions?). Monitor: task completion rate, agent error rate, total cost and latency per request, human escalation rate. Replay production failures in a sandbox to understand failure chains.
-
"What's the difference between your system and a simple workflow/pipeline?" A workflow is pre-defined — step A always leads to step B. A multi-agent system is dynamic — the orchestrator decides at runtime which agents to call, in what order, based on intermediate results. The power of the multi-agent approach is this adaptability. The risk is unpredictability — use multi-agent only when the task genuinely requires dynamic adaptation.