Why This Is Asked
Real-time event pipelines are the heartbeat of modern data platforms. This question tests whether you can reason about high-throughput ingestion, stream processing semantics, and the trade-offs between latency, cost, and reliability.
Key Concepts to Cover
- Ingestion layer — Kafka, Kinesis, or Pub/Sub as the durable event bus
- Stream processing — Flink, Spark Streaming, or Kafka Streams for transformation
- Exactly-once semantics — idempotent consumers, transactional writes
- Schema evolution — Avro/Protobuf with a schema registry
- Backpressure — what happens when consumers fall behind producers
- Monitoring — consumer lag, end-to-end latency SLAs
How to Approach This
1. Clarify Requirements
- What event volume? (millions/second vs thousands/second changes architecture significantly)
- What latency SLA? (sub-second vs 30 seconds vs minutes)
- Who are the downstream consumers? (real-time dashboards, ML features, analytics queries)
- What durability guarantees? (at-most-once, at-least-once, exactly-once)
2. High-Level Architecture
Producers → Kafka (event bus) → Stream Processor → Hot Store (Redis/DynamoDB)
→ Cold Store (S3/GCS)
→ OLAP (ClickHouse/BigQuery)
3. Exactly-Once Semantics
- Producers write with idempotent keys
- Kafka transactions ensure atomic writes across topics
- Consumers checkpoint offsets only after successful downstream writes
- Deduplication at the sink using event IDs if exactly-once isn't end-to-end
4. Schema Evolution
- Avro or Protobuf schemas registered in a Schema Registry (Confluent)
- Backward-compatible changes: add optional fields
- Breaking changes: new topic + migration job
- Consumer reads schema version from message header
Common Follow-ups
-
"What happens if your stream processor falls behind by 10 minutes?" Consumer lag alerting, auto-scaling, dead letter queue for processing failures.
-
"How do you handle late-arriving events?" Watermarks in Flink define how late you'll tolerate; late events go to a side output or a separate late-data table.
-
"How do you test this pipeline?" Unit tests for transformation logic, integration tests with embedded Kafka, end-to-end latency regression tests in staging.