Design a system that uses LLMs to automatically review pull requests — identifying bugs, style issues, and suggesting improvements at scale.

Learn how to design an AI-powered code review system. Covers LLM integration, diff analysis, comment generation, latency, and feedback loops.

Design an AI Code Review System - Interview Question

Why This Is Asked

Interviewers use this question to test your ability to design a complex AI system that integrates with existing engineering workflows. They want to see if you can reason about: how to represent code for LLMs, how to handle the latency constraints of PR review, how to evaluate AI-generated comments for quality, and how to build feedback loops that improve the system over time.

Key Concepts to Cover

Code representation — how to prepare code diffs for LLM context windows
Context limits — handling large PRs that exceed token limits
Comment generation — structured output for actionable review comments
Latency requirements — async vs. real-time review pipelines
Quality evaluation — how to measure if AI comments are helpful
Feedback loops — using developer reactions to improve the model
Integration — GitHub/GitLab webhooks, PR comment APIs
Cost management — controlling LLM API costs at scale

How to Approach This

1. Clarify Requirements

Start by asking:

What types of review? (bugs, style, security, performance, all?)
What scale? (10 PRs/day vs. 10,000 PRs/day changes the architecture)
Latency requirements? (blocking the PR merge vs. async background comments)
What languages/frameworks? (affects context and prompting)
Human-in-the-loop or fully automated comments?

2. High-Level Architecture

GitHub Webhook → Queue → Diff Processor → Context Builder → LLM Service → Comment Poster
                                                                    ↓
                                                            Feedback Collector → Fine-tuning Pipeline

3. Deep Dive: Handling Large Diffs

Most real PRs exceed LLM context windows. Strategies:

File-level batching: process each changed file independently
Chunk with overlap: split large files into overlapping chunks
Prioritization: rank files by change size, complexity, or risk; review the top N
Summary pass: first pass to summarize the PR, second pass for detailed review of flagged areas

4. Deep Dive: Prompt Design

Structure prompts to produce actionable, structured output:

You are a senior engineer reviewing a pull request.
Given this code diff, identify issues and suggest improvements.

Output as JSON:
{
  "comments": [
    {
      "file": "string",
      "line": number,
      "severity": "bug" | "suggestion" | "nitpick",
      "comment": "string",
      "suggestion": "string (optional)"
    }
  ]
}

Code diff:
{diff}

5. Evaluation and Feedback Loop

This is what separates strong candidates. Discuss:

Track thumbs up/down on AI comments
Monitor comment acceptance rate (did the developer act on it?)
Use accepted comments as positive training examples
Use dismissed comments as negative examples
Build an eval set of PRs with known issues for regression testing

6. Cost and Latency

Cache review results for identical diffs (content hash)
Use smaller/faster models for initial triage, larger models for complex files
Process async — don't block PR creation, post comments within minutes
Rate limit per repository to control costs

Common Follow-ups

"How would you handle false positives — AI comments that are wrong or unhelpful?" Discuss confidence thresholds, filtering comments below a quality score, requiring human approval for high-severity findings, and surfacing uncertainty in the comment text itself.
"How do you keep the system up to date as coding practices evolve?" Cover periodic retraining on recent accepted comments, prompt updates as new patterns emerge, and A/B testing new prompt versions against a held-out eval set.
"What security concerns do you have about sending code to an external LLM API?" Address data privacy (on-prem models for sensitive code), API data retention policies, secrets scanning before sending diffs, and redacting sensitive values from diffs.

Design an AI-Powered Code Review System

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Clarify Requirements

2. High-Level Architecture

3. Deep Dive: Handling Large Diffs

4. Deep Dive: Prompt Design

5. Evaluation and Feedback Loop

6. Cost and Latency

Common Follow-ups

Related Questions

Design a Real-Time Content Moderation Pipeline Using LLMs

How Do You Build an Eval Suite for an LLM-Powered Feature?

How Would You Detect and Handle LLM Output Regressions?

Prep for the full interview loop