Why This Is Asked
Query optimization is a core data engineering skill. This question tests whether you can reason about execution plans, storage layout, and query rewrites — and explain your thinking clearly.
Key Concepts to Cover
- Partition pruning — query should filter on the partition key to skip irrelevant partitions
- Clustering/sorting — co-locating related rows reduces bytes scanned
- Execution plan reading — identify full table scans, hash joins, sort operations
- Query rewrites — avoid
SELECT *, push filters early, replace correlated subqueries - Materialized views / pre-aggregation — for expensive repeated aggregations
- Approximate aggregations — HyperLogLog for cardinality, sampling for trends
How to Approach This
1. Diagnose First
Before optimizing, look at the execution plan:
- Is it doing a full scan when it should be partition-pruning?
- Are there implicit type casts preventing index use?
- Is the join order causing a large intermediate result set?
2. Partitioning Fix
If the table is partitioned by event_date and the query doesn't filter on it:
-- Bad: scans all partitions
SELECT user_id, COUNT(*) FROM events WHERE event_type = 'purchase'
-- Good: partition pruning
SELECT user_id, COUNT(*) FROM events
WHERE event_date >= '2026-01-01' AND event_type = 'purchase'
3. Pre-Aggregation
If the same query runs 1000 times/day, materialize it:
-- Materialized view refreshed nightly
CREATE MATERIALIZED VIEW daily_purchase_counts AS
SELECT event_date, user_id, COUNT(*) as purchase_count
FROM events WHERE event_type = 'purchase'
GROUP BY event_date, user_id;
Common Follow-ups
-
"The query uses a window function over the whole table — how do you speed that up?" Partition the window function by a high-cardinality column (user_id), push a date filter before the window.
-
"What if you can't change the query — only the table structure?" Add a clustering key on the most common filter columns; create a pre-aggregated summary table.