Intermediate2 min read

Optimize a Slow Reporting Query on a 10-Billion-Row Table

A dashboard query scanning a 10-billion-row events table takes 8 minutes to run. Walk through how you'd diagnose and fix it.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Query optimization is a core data engineering skill. This question tests whether you can reason about execution plans, storage layout, and query rewrites — and explain your thinking clearly.

Key Concepts to Cover

  • Partition pruning — query should filter on the partition key to skip irrelevant partitions
  • Clustering/sorting — co-locating related rows reduces bytes scanned
  • Execution plan reading — identify full table scans, hash joins, sort operations
  • Query rewrites — avoid SELECT *, push filters early, replace correlated subqueries
  • Materialized views / pre-aggregation — for expensive repeated aggregations
  • Approximate aggregations — HyperLogLog for cardinality, sampling for trends

How to Approach This

1. Diagnose First

Before optimizing, look at the execution plan:

  • Is it doing a full scan when it should be partition-pruning?
  • Are there implicit type casts preventing index use?
  • Is the join order causing a large intermediate result set?

2. Partitioning Fix

If the table is partitioned by event_date and the query doesn't filter on it:

-- Bad: scans all partitions
SELECT user_id, COUNT(*) FROM events WHERE event_type = 'purchase'

-- Good: partition pruning
SELECT user_id, COUNT(*) FROM events
WHERE event_date >= '2026-01-01' AND event_type = 'purchase'

3. Pre-Aggregation

If the same query runs 1000 times/day, materialize it:

-- Materialized view refreshed nightly
CREATE MATERIALIZED VIEW daily_purchase_counts AS
SELECT event_date, user_id, COUNT(*) as purchase_count
FROM events WHERE event_type = 'purchase'
GROUP BY event_date, user_id;

Common Follow-ups

  1. "The query uses a window function over the whole table — how do you speed that up?" Partition the window function by a high-cardinality column (user_id), push a date filter before the window.

  2. "What if you can't change the query — only the table structure?" Add a clustering key on the most common filter columns; create a pre-aggregated summary table.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview