Why This Is Asked
Batch ETL is the workhorse of most data warehouses. This question tests your ability to build reliable, maintainable pipelines at scale — including the operability details that separate senior engineers from juniors.
Key Concepts to Cover
- Orchestration — Airflow or Prefect DAGs; dependency management between jobs
- Idempotency — every job must be safe to re-run without duplicating data
- Backfill strategy — how do you reprocess 90 days of data when logic changes?
- Data quality gates — reject or quarantine bad data before it reaches analysts
- Incremental vs full refresh — when to use each and the trade-offs
- dbt integration — transformation layer on top of raw ingested data
How to Approach This
1. Clarify Requirements
- How many source systems? What protocols? (REST APIs, databases, files on S3)
- What's the freshness SLA? (nightly, hourly, near-real-time)
- What's the downstream contract? (analysts expect data by 8am)
- What happens on failure? (silent failure vs alert vs block dashboard)
2. Pipeline Stages
Sources → Extraction jobs (Airbyte/custom) → Raw layer (S3/GCS)
→ dbt staging models → dbt marts → Warehouse (Snowflake/BigQuery)
→ Data quality checks → Dashboard refresh
3. Idempotency Pattern
- Use
INSERT OVERWRITE(partition-based) orMERGEstatements — never plainINSERT - Partition raw data by
ingestion_date— reprocessing a date replaces only that partition - Store watermarks or last-processed cursors in a control table
4. Data Quality Gates
- Row count checks: today's load within N% of yesterday's
- Null rate checks on non-nullable fields
- Referential integrity: foreign keys resolve
- On failure: quarantine to a
_badtable, alert on-call, block downstream refresh
Common Follow-ups
-
"How do you handle a source system that sends the same record twice?" Deduplication on a natural key in the raw-to-staging transform.
-
"What happens when a dbt model logic change requires reprocessing 3 months of data?" Backfill run with
dbt run --select model --full-refresh, partition-safe idempotent transforms. -
"How do you prioritize which pipelines to run first?" Airflow DAG dependencies + SLA-based priority queues; critical path analysis to ensure high-value dashboards land on time.