Apache Beam for Data Engineers: One Pipeline to Rule Batch and Streaming
Ever shipped a batch ETL on Spark, then got asked to make it real-time… and ended up maintaining two codebases? Apache Beam kills that duplication. You write one pipeline, pick a runner (Dataflow, Flink, Spark, Samza), and deploy it as batch or streaming with the same code and mental model.
Why Beam matters (fast context)
- Unifies batch + streaming. Same APIs (Python/Java/Go), same transforms.
- Portable across engines. Dataflow (managed), Flink (self-hosted), Spark (your cluster).
- First-class event time. Windows, triggers, watermarks, late data handling.
- Schema-aware, rich I/O. Kafka, Pub/Sub, BigQuery, JDBC, Elasticsearch, S3/GCS, and more.
- Stateful processing. Stateful
ParDo+ timers for complex streaming logic.
Core concepts (mental model)
- PCollection – your data stream (bounded for batch, unbounded for streaming).
- PTransform – operations (map/filter/aggregate joins). Built-ins like
ParDo,GroupByKey,Combine,Flatten,CoGroupByKey. - Windowing – slice unbounded data into finite panes: fixed, sliding, sessions.
- Triggers – decide when to emit partial/final results (on watermark, processing time, element count).
- Watermarks – runner’s notion of event-time progress; enables late-data policy (
allowed_lateness+ accumulation modes). - State & Timers – per-key durable state with processing/event-time timers for patterns like dedup, sessions, and SLO checks.
- Runners – execution backends (Dataflow/Flink/Spark). You compile once, choose later.
Architecture at a glance
- Ingest: Kafka / Pub/Sub / Kinesis →
ReadFrom…I/O - Transform:
ParDo(enrich),WithKeys,GroupByKey/CombinePerKey(aggs), window + trigger - Persist/Serve: BigQuery / Elasticsearch / JDBC / S3 / Cassandra (via connectors or custom sinks)
- Ops: Runner handles autoscaling, checkpoints, backpressure, watermarking, exactly-once (sink dependent)
Real example: streaming page views → sessionized metrics (Python SDK)
Goal: From Kafka events (user_id, url, ts), compute session counts and top pages per 5-minute fixed windows, with on-time and late updates.
import json, apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.transforms.window import FixedWindows, TimestampedValue
from apache_beam.transforms.trigger import AfterWatermark, AccumulationMode, AfterProcessingTime, Repeatedly, AfterPane
class ParseEvent(beam.DoFn):
def process(self, record):
key, value = record # from ReadFromKafka: (key_bytes, value_bytes)
e = json.loads(value.decode("utf-8"))
yield TimestampedValue(
{"user_id": e["user_id"], "url": e["url"]},
e["ts"] / 1_000.0 # ms → seconds event-time
)
def to_kv(e): # (url, 1)
return (e["url"], 1)
def sum_counts(a, b):
return a + b
opts = PipelineOptions(
streaming=True,
save_main_session=True
)
opts.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=opts) as p:
events = (
p
| beam.io.ReadFromKafka(
consumer_config={"bootstrap.servers": "kafka:9092", "group.id": "beam"},
topics=["pageviews"])
| beam.ParDo(ParseEvent())
)
windowed = (
events
| "5mWindow" >> beam.WindowInto(
FixedWindows(300),
trigger=Repeatedly(
AfterWatermark(early=AfterProcessingTime(60)) # early every 60s
),
allowed_lateness=60*10, # accept 10 min late data
accumulation_mode=AccumulationMode.ACCUMULATING)
)
counts = (
windowed
| beam.Map(to_kv)
| beam.CombinePerKey(sum_counts) # per-URL counts per window
)
# Example sink: Elasticsearch (or BigQuery with WriteToBigQuery)
from apache_beam.io.elasticsearch import WriteToElasticsearch
(
counts
| beam.Map(lambda kv: {"url": kv[0], "count": kv[1]})
| WriteToElasticsearch(
hosts=[{"host": "es", "port": 9200}],
index="pageview_counts",
doc_type="_doc",
flush_interval_secs=5)
)
What this demonstrates
- Event-time correctness via
TimestampedValue. - Windowing + early results + late data.
- Accumulating mode so dashboards update as panes refine.
- Pluggable sink (swap to BigQuery or JDBC with minimal code).
Windows, triggers, lateness — the 80/20 guide
- Fixed windows (tumbling): reporting, hourly/5m aggregations.
- Sliding windows: moving averages; more compute.
- Sessions: user activity bursts with gaps (e.g., 30 min inactivity ends a session).
- Triggers:
- On-time: when watermark passes window end (best-effort completeness).
- Early: periodic interim results (dashboards feel “live”).
- Late: accept stragglers; combine with dedup if needed.
- Accumulation:
- Accumulating: each firing includes previous counts (friendly for charts).
- Discarding: each firing is incremental (harder to consume, cheaper).
Exactly-once: what you really get
- Within Beam transforms: effectively-once via checkpointing.
- End-to-end: depends on your sink and runner semantics.
- BigQuery streaming inserts are at-least-once; use idempotency/dedup keys.
- Kafka → Beam → Elasticsearch: use deterministic IDs to avoid dup docs.
- Rule: design idempotent writes (upserts with a stable key:
window_start|url).
Stateful & timer-based patterns (streaming superpowers)
- Per-key dedup: store recent IDs in state + set TTL timer.
- Sessionization: accumulate events per user; emit when inactivity timer fires.
- SLO/alerts: start timer at first error; fire if recovery didn’t happen.
(Available in Python and Java; Python state/timers require appropriate runner support.)
Beam vs Spark Structured Streaming vs Flink (quick view)
| Criterion | Apache Beam (with Dataflow/Flink/Spark runner) | Spark Structured Streaming | Apache Flink |
|---|---|---|---|
| API model | Unified batch+stream spec, portable | Tight Spark integration | Native streaming-first |
| Portability | High (runners) | Low (Spark only) | Medium (Flink only) |
| Event-time features | Strong (windows, triggers, watermarks, lateness) | Strong but simpler triggers | Very strong; low-latency CEP |
| Latency profile | Low to sub-second (runner dependent) | Typically seconds+ micro-batch | Sub-second, continuous |
| Ecosystem/I/O | Broad via Beam I/O | Huge Spark ecosystem | Mature connectors; CEP |
| Ops | Choose managed (Dataflow) or self-host (Flink/Spark) | Requires Spark ops | Requires Flink ops |
Bottom line:
- Need portability + managed ops? Beam + Dataflow.
- Deep Spark shop? Spark Streaming.
- Ultra-low-latency / CEP? Flink.
- Want one codebase for batch+stream across engines? Beam.
Best practices
- Design by access patterns. Decide keys/windows/SLAs first; code second.
- Event time, always. Produce/propagate event timestamps early.
- Stable IDs. Deterministic keys for idempotent sinks (avoid dupes).
- Backpressure & autoscaling: choose a runner that supports both well (e.g., Dataflow autoscaling).
- Schema-aware PCollections: use Beam Schemas to prevent drift.
- Side inputs > globals. For small reference data; refresh carefully.
- Observability: log panes, watermarks, trigger firings; export metrics/counters.
- Load tests: validate watermarks, late-data rates, write throughput before prod.
Common pitfalls (and how to dodge them)
- Treating streaming as micro-batch: you’ll mis-handle lateness; use windows/triggers appropriately.
- Ignoring watermark lag: dashboards “freeze” — set early triggers and visualize watermark.
- At-least-once sinks without dedup: add a natural key and upsert.
- Hot keys: skew kills throughput. Re-key with salting or pre-aggregation.
- Unbounded state growth: define timers/TTL; bound state size.
- Local dev mismatch: test with DirectRunner + integration tests against a real runner (Dataflow/Flink) before launch.
Deployment cheat sheet
- Local: DirectRunner for unit/integration tests.
- Managed: Google Cloud Dataflow (auto-scaling, streaming engine, Dataflow SQL).
- Self-hosted: Flink runner (K8s/YARN), Spark runner (K8s/YARN).
- Packaging: containerize with
--experiments=use_runner_v2where relevant; pin SDK + runner versions. - Config: externalize Kafka/DB creds; use secrets manager.
Where Beam fits with NoSQL & real-time analytics
- MongoDB / DynamoDB upserts: idempotent materialized views keyed by business entity + window.
- Elasticsearch search views: power live dashboards with early + accumulating panes.
- Cassandra time-series: write partitioned by (metric_id, bucket_start); avoid wide partitions.
- Redis: emit hot aggregates for low-latency APIs; TTL-align with windowing.
- BigQuery lakehouse: batch compacted facts; streaming trickle with dedup keys.
Internal link ideas (for your blog/docs)
- Windowing & Triggers Deep Dive: Event Time vs Processing Time
- Designing Idempotent Writes to BigQuery and Elasticsearch
- Kafka to Beam: At-Least-Once to Exactly-Once with Deterministic Keys
- Choosing a Runner: Dataflow vs Flink vs Spark for Streaming Workloads
- Sessionization Patterns with Beam State & Timers (Python/Java)
Conclusion & takeaways
Apache Beam lets you stop duplicating batch and streaming logic. Model your problem once (with event-time, windows, triggers), pick a runner that matches your ops reality, and ship pipelines that are portable, correct, and observable.
Remember:
- Get the keys and windows right first.
- Make sinks idempotent.
- Measure watermarks and late data.
- Prefer managed when speed matters; self-host when control matters.
Image prompt (ready for DALL·E/Midjourney)
“A clean, modern data architecture diagram illustrating an Apache Beam pipeline reading from Kafka, applying windowing and triggers, and writing to Elasticsearch and BigQuery via different runners (Dataflow, Flink). Minimalistic, high contrast, 3D isometric style with labeled windows, watermarks, and late data arrows.”
Tags
#ApacheBeam #DataEngineering #StreamingData #BatchProcessing #EventTime #Windowing #Kafka #Dataflow #Flink #BigData
ApacheBeam, DataEngineering, StreamingData, BatchProcessing, EventTime, Windowing, Kafka, Dataflow, Flink, BigData
If you want, tell me your target runner (Dataflow/Flink/Spark) and sink (BigQuery/ES/Cassandra), and I’ll tailor a production-ready template with CI/CD, configs, and observability baked in.




