Design a multi-agent system where specialized agents collaborate on software development — covering orchestration, communication, coordination, and failure modes.

Learn how to design a multi-agent system for software development tasks. Covers orchestrator-subagent patterns, communication, coordination, and failure handling.

Design a Multi-Agent System for Software Development

Why This Is Asked

Multi-agent systems represent the frontier of applied AI engineering. They're also notoriously difficult to make reliable. Interviewers use this question to see if you can reason about emergent complexity — how individual agents' behaviors compose (or fail to compose) into a coherent system.

Key Concepts to Cover

Orchestrator-subagent pattern — one agent coordinates many specialized agents
Agent specialization — why specialized agents outperform generalist agents
Communication protocols — how agents pass information and tasks
Parallelization — running independent subtasks in parallel
Error handling — what happens when a subagent fails
Context sharing — how agents share state without redundancy
Avoiding agent loops — preventing circular dependencies
Observability — tracing execution across multiple agents

How to Approach This

1. Why Multi-Agent?

A single agent trying to write, review, test, and document code:

Has a limited context window (can't hold all code in memory)
Makes errors harder to isolate (which part of the task failed?)
Can't parallelize subtasks
Can't specialize — "jack of all trades, master of none"

Specialized agents each excel at their domain. The system is more capable than any single agent.

2. System Architecture: Orchestrator + Specialists

User Request
      |
Orchestrator Agent
  |-- Planner: break request into subtasks
  |-- Assigns to specialists:
  |     |-- Architect Agent: design system, propose file structure
  |     |-- Coder Agent: implement code for each module
  |     |-- Reviewer Agent: review code for bugs and style
  |     |-- Test Agent: write and run tests
  |     |-- Doc Agent: generate documentation
  |-- Synthesizer: combine outputs into final deliverable

3. Communication Patterns

Message passing: Agents communicate through a shared message bus or queue. Each agent reads tasks addressed to it and writes results to a shared store.

Shared state: A central state object (JSON document) that all agents read from and write to. Each agent updates the portion of state it owns.

Direct handoff: Orchestrator directly calls each subagent in sequence or in parallel, passing context explicitly. Simplest to reason about.

For most production systems, start with direct handoff — it's the easiest to debug and trace.

4. Parallelization

Independent tasks can run in parallel:

Orchestrator receives: "Build a user authentication module"

Sequential:
Architect -> Coder -> Reviewer -> Tester -> Docs (slow)

Parallel where possible:
Architect
    |
Coder (backend) -- parallel -- Coder (frontend)
    |                               |
Reviewer ---------- parallel -- Reviewer
    |                               |
(merge results)
    |
Tester (integration tests)
    |
Doc Agent

5. Error Handling

When a subagent fails:

Retry: Re-run the subagent with the same input (handles transient failures)
Fallback: The orchestrator tries a different approach for the failed subtask
Human escalation: If retries fail, surface the failure to a human with context
Graceful degradation: Complete the task without the failing component where possible

Crucially, the orchestrator must handle failures gracefully without causing the whole system to fail.

6. Avoiding Agent Loops

Circular dependencies (Agent A calls Agent B which calls Agent A) can cause infinite loops:

For static workflows: Define a clear DAG (directed acyclic graph) of agent dependencies up front. If the call graph is pre-defined, cycles are structurally impossible.
For dynamic systems: When the orchestrator decides at runtime which agents to call, a pre-defined DAG cannot always be enforced. Instead: track the full call graph per request and detect if any agent is called with the same input a second time (cycle detection), set a maximum call depth, and require each agent call to advance measurable state.
Maximum depth limit on agent calls (e.g., no agent can be called more than 3 times per session)
Each agent call should make progress (move state forward) — detect and break cycles where state hasn't changed

7. Observability

With multiple agents, debugging is hard without tracing:

Assign a unique trace ID to each top-level request
Log every agent invocation with: agent name, input, output, duration, cost
Build a timeline view of which agents ran, in what order, with what results
Alert on: runaway chains (too many agent calls), excessive cost per request, agent failure rates

Common Follow-ups

"How do you prevent agents from duplicating work?" Use a shared task registry: when an agent picks up a subtask, it marks it as "in progress." Other agents check this before starting similar work. Alternatively, the orchestrator explicitly assigns non-overlapping tasks and maintains ownership.
"How do you evaluate a multi-agent system?" Evaluate end-to-end (did the final output meet requirements?) and per-agent (is each agent making good decisions?). Monitor: task completion rate, agent error rate, total cost and latency per request, human escalation rate. Replay production failures in a sandbox to understand failure chains.
"What's the difference between your system and a simple workflow/pipeline?" A workflow is pre-defined — step A always leads to step B. A multi-agent system is dynamic — the orchestrator decides at runtime which agents to call, in what order, based on intermediate results. The power of the multi-agent approach is this adaptability. The risk is unpredictability — use multi-agent only when the task genuinely requires dynamic adaptation.