Intermediate2 min read

Design a Batch ETL Platform for a Data Warehouse

Design a batch ETL system that ingests data from 50+ source systems nightly, transforms it into a clean analytics layer, and surfaces data quality issues before dashboards are updated.

Asked at:

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Batch ETL is the workhorse of most data warehouses. This question tests your ability to build reliable, maintainable pipelines at scale — including the operability details that separate senior engineers from juniors.

Key Concepts to Cover

  • Orchestration — Airflow or Prefect DAGs; dependency management between jobs
  • Idempotency — every job must be safe to re-run without duplicating data
  • Backfill strategy — how do you reprocess 90 days of data when logic changes?
  • Data quality gates — reject or quarantine bad data before it reaches analysts
  • Incremental vs full refresh — when to use each and the trade-offs
  • dbt integration — transformation layer on top of raw ingested data

How to Approach This

1. Clarify Requirements

  • How many source systems? What protocols? (REST APIs, databases, files on S3)
  • What's the freshness SLA? (nightly, hourly, near-real-time)
  • What's the downstream contract? (analysts expect data by 8am)
  • What happens on failure? (silent failure vs alert vs block dashboard)

2. Pipeline Stages

Sources → Extraction jobs (Airbyte/custom) → Raw layer (S3/GCS)
        → dbt staging models → dbt marts → Warehouse (Snowflake/BigQuery)
        → Data quality checks → Dashboard refresh

3. Idempotency Pattern

  • Use INSERT OVERWRITE (partition-based) or MERGE statements — never plain INSERT
  • Partition raw data by ingestion_date — reprocessing a date replaces only that partition
  • Store watermarks or last-processed cursors in a control table

4. Data Quality Gates

  • Row count checks: today's load within N% of yesterday's
  • Null rate checks on non-nullable fields
  • Referential integrity: foreign keys resolve
  • On failure: quarantine to a _bad table, alert on-call, block downstream refresh

Common Follow-ups

  1. "How do you handle a source system that sends the same record twice?" Deduplication on a natural key in the raw-to-staging transform.

  2. "What happens when a dbt model logic change requires reprocessing 3 months of data?" Backfill run with dbt run --select model --full-refresh, partition-safe idempotent transforms.

  3. "How do you prioritize which pipelines to run first?" Airflow DAG dependencies + SLA-based priority queues; critical path analysis to ensure high-value dashboards land on time.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview