Advanced4 min read

How Do You Handle Multi-Hop and Multifaceted Queries in a RAG System?

Single-shot retrieval breaks down for complex questions that require reasoning across multiple documents. Walk through strategies to handle multi-hop and multifaceted queries.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Basic RAG handles simple lookup queries well. Interviewers use multi-hop questions to probe whether you understand the limits of single-shot retrieval and can design more sophisticated pipelines for production use cases.

Key Concepts to Cover

  • Multi-hop queries — require chaining information from multiple sources
  • Multifaceted queries — have multiple distinct sub-questions in one query
  • Query decomposition — breaking a complex query into simpler sub-queries
  • Iterative retrieval — using intermediate answers to inform subsequent retrievals
  • Fusion strategies — combining results from multiple retrievals

How to Approach This

1. Define the Problem Types

Multi-hop: "Who is the CEO of the company that acquired Figma?"

  • Step 1: retrieve which company acquired Figma (Adobe)
  • Step 2: use that answer to retrieve Adobe's CEO
  • A single retrieval with the original query will likely fail — no chunk answers both hops

Multifaceted: "Compare the pricing, feature set, and latency of GPT-4o vs. Claude 3.5 Sonnet"

  • Three distinct sub-questions bundled into one
  • A single retrieval returns one document; needs coverage across all three dimensions

2. Query Decomposition

Use an LLM to break the original query into independent sub-queries before retrieval:

Original: "What are the pros and cons of HNSW vs IVF indexing for a 100M vector corpus?"
Sub-queries:
  1. "How does HNSW indexing work?"
  2. "How does IVF indexing work?"
  3. "Memory and latency tradeoffs of HNSW at scale"
  4. "When to use IVF vs HNSW for large corpora"

Run each sub-query independently, then pass all retrieved chunks to the LLM for final synthesis.

Tradeoff: 4x retrieval calls = 4x latency and cost. Cache sub-query results aggressively.

3. Iterative / Sequential Retrieval

For true multi-hop questions where sub-query 2 depends on the answer to sub-query 1:

  1. Retrieve and partially answer the first hop
  2. Use the intermediate answer to construct the next retrieval query
  3. Repeat until all hops are resolved
  4. Synthesize the final answer

This is similar to the ReAct pattern applied to retrieval — the system reasons and retrieves in alternating steps.

When to use: only when sub-queries have strict dependencies. Otherwise, parallel decomposition is faster.

4. Multi-Query Retrieval (Simpler Variant)

For queries that are merely ambiguous (not truly multi-hop), generate 3-5 paraphrases of the original query, run each independently, and union or re-rank the results:

Query: "Why is my RAG system slow?"
Variants:
  - "RAG latency bottlenecks"
  - "slow retrieval augmented generation causes"
  - "optimize RAG pipeline performance"

Union the top-k from each, deduplicate, re-rank. This captures vocabulary variation without the overhead of full decomposition.

5. Fusion Strategies

When combining results from multiple sub-queries:

  • Reciprocal Rank Fusion (RRF): score each chunk based on its rank across multiple result sets; robust and parameter-free
  • Score fusion: weighted average of similarity scores across queries; requires score normalization
  • LLM-as-reranker: pass all candidates to an LLM with the original query; ask it to select the most relevant; most accurate but expensive

6. Map-Reduce for Multifaceted Queries

For the comparison query pattern:

  1. Map: answer each dimension independently ("pricing", "features", "latency") with its own retrieval + LLM call
  2. Reduce: combine all partial answers into a unified comparison

Advantage: each sub-answer is grounded in relevant docs; no single context window overload.

Common Follow-ups

  1. "How do you decide when to decompose vs. just retrieve more chunks?" Retrieve more chunks first (increase k) — it's simpler and often sufficient for mild multi-faceted queries. Use decomposition when the query requires truly distinct topics that would otherwise compete for context space, or when sub-queries have dependencies.

  2. "How do you prevent the LLM from hallucinating connections between hops?" Instruct the model to quote from retrieved chunks at each hop, not infer. Use structured chain-of-thought prompting that explicitly separates "what I found" from "what I conclude." Each hop should be grounded.

  3. "What's the performance cost of iterative retrieval?" Each hop adds one embedding + vector search + LLM call — roughly 200-500ms per hop at typical production latencies. For 3-hop queries, expect 600ms-1.5s added latency. Cache intermediate results; use smaller/faster models for intermediate hops.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview