Design a batch ETL system that ingests data from 50+ source systems nightly, transforms it into a clean analytics layer, and surfaces data quality issues before dashboards are updated.

Learn how to design a batch ETL platform. Covers orchestration with Airflow/dbt, idempotency, backfill strategy, data quality, and monitoring.

Design a Batch ETL Platform - Data Engineering Interview

Why This Is Asked

Batch ETL is the workhorse of most data warehouses. This question tests your ability to build reliable, maintainable pipelines at scale — including the operability details that separate senior engineers from juniors.

Key Concepts to Cover

Orchestration — Airflow or Prefect DAGs; dependency management between jobs
Idempotency — every job must be safe to re-run without duplicating data
Backfill strategy — how do you reprocess 90 days of data when logic changes?
Data quality gates — reject or quarantine bad data before it reaches analysts
Incremental vs full refresh — when to use each and the trade-offs
dbt integration — transformation layer on top of raw ingested data

How to Approach This

1. Clarify Requirements

How many source systems? What protocols? (REST APIs, databases, files on S3)
What's the freshness SLA? (nightly, hourly, near-real-time)
What's the downstream contract? (analysts expect data by 8am)
What happens on failure? (silent failure vs alert vs block dashboard)

2. Pipeline Stages

Sources → Extraction jobs (Airbyte/custom) → Raw layer (S3/GCS)
        → dbt staging models → dbt marts → Warehouse (Snowflake/BigQuery)
        → Data quality checks → Dashboard refresh

3. Idempotency Pattern

Use INSERT OVERWRITE (partition-based) or MERGE statements — never plain INSERT
Partition raw data by ingestion_date — reprocessing a date replaces only that partition
Store watermarks or last-processed cursors in a control table

4. Data Quality Gates

Row count checks: today's load within N% of yesterday's
Null rate checks on non-nullable fields
Referential integrity: foreign keys resolve
On failure: quarantine to a _bad table, alert on-call, block downstream refresh

Common Follow-ups

"How do you handle a source system that sends the same record twice?" Deduplication on a natural key in the raw-to-staging transform.
"What happens when a dbt model logic change requires reprocessing 3 months of data?" Backfill run with dbt run --select model --full-refresh, partition-safe idempotent transforms.
"How do you prioritize which pipelines to run first?" Airflow DAG dependencies + SLA-based priority queues; critical path analysis to ensure high-value dashboards land on time.

Design a Batch ETL Platform for a Data Warehouse

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Clarify Requirements

2. Pipeline Stages

3. Idempotency Pattern

4. Data Quality Gates

Common Follow-ups

Related Questions

Design a Real-Time Event Ingestion Pipeline

Design an Analytics Schema for an E-Commerce Platform

Prep for the full interview loop