Spark Streaming – Data/ML Engineer Blog

Spark Structured Streaming: Real-Time Data Without the Drama

If your dashboards are stale, your fraud rules miss the first minute of an attack, or your ETL arrives “tomorrow morning,” you’re leaving value on the floor. Spark Structured Streaming lets you turn data-in-motion into fresh features, alerts, and metrics—at the same scale you already trust Spark for batch. The trick is designing streams that are correct, resilient, and cheap under real load. Let’s get you there.

Why Spark Streaming Matters (and when it doesn’t)

You already run Spark for batch. Reusing skills and infrastructure lowers time-to-prod.
The same DataFrame & SQL APIs apply to streaming—no bespoke framework.
Exactly-once processing semantics for many use cases; strong integrations with Kafka, Delta/Parquet, cloud object storage.
When not to use it: ultra-low-latency (<20 ms) trading, per-event RPC, or when a tiny Flink job would be simpler. Pick the right hammer.

Core Concepts (No Hand-Waving)

Micro-Batch vs Continuous Processing

Spark primarily uses micro-batches—small time windows of data processed repeatedly.

Aspect	Micro-Batch (default)	Continuous Processing (experimental niches)
Latency	100 ms–seconds	~10–50 ms
Throughput	Very high	High
Features	Full SQL/DataFrame, stateful ops, joins, watermarks	Limited APIs
Operational Maturity	Production-grade	Use only if you truly need sub-100 ms

Stick with micro-batch unless you have a proven continuous-latency requirement.

Sources & Sinks

Common sources: Kafka, Kinesis, files (cloud object storage), sockets (dev only).
Common sinks: Delta/Parquet/S3/ADLS/GCS, Kafka, console (dev), memory (tests).

Checkpointing

Every stream needs a reliable checkpoint location (e.g., S3/ADLS/GCS). It stores offsets, progress, and state so restarts are safe.

Watermarking & Event Time

Real data is late. Watermarks tell Spark how long to wait for late events during aggregations and dedup. You trade off correctness vs latency/cost consciously.

Stateful Operators

Aggregations, dedup, and stream-stream joins keep state. State grows with key cardinality and watermark horizons; left unchecked, it will eat your cluster.

A Practical Example (Kafka → Real-Time Metrics in Delta)

Goal: Ingest orders from Kafka, deduplicate on order_id, aggregate revenue per minute (event-time), and write to a Delta table for BI.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, to_timestamp, window, sum as _sum
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

spark = (SparkSession.builder
         .appName("orders-stream")
         .getOrCreate())

schema = StructType([
    StructField("order_id", StringType()),
    StructField("user_id", StringType()),
    StructField("amount",  DoubleType()),
    StructField("event_ts", StringType())  # ISO8601
])

raw = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
    .option("subscribe", "orders")
    .option("startingOffsets", "latest")
    .load())

orders = (raw.selectExpr("CAST(value AS STRING) as json")
    .select(from_json(col("json"), schema).alias("r"))
    .select(
        col("r.order_id"),
        col("r.user_id"),
        col("r.amount"),
        to_timestamp("r.event_ts").alias("event_time"))
)

# 1) Dedup by business key + watermark
deduped = (orders
    .withWatermark("event_time", "10 minutes")
    .dropDuplicates(["order_id"])
)

# 2) Tumbling window revenue
metrics = (deduped
    .withWatermark("event_time", "10 minutes")
    .groupBy(window(col("event_time"), "1 minute"))
    .agg(_sum("amount").alias("revenue"))
    .select(
        col("window.start").alias("minute_start"),
        col("window.end").alias("minute_end"),
        col("revenue"))
)

# 3) Write to Delta (mergeable, ACID, good for BI)
query = (metrics.writeStream
    .format("delta")
    .option("checkpointLocation", "s3://my-bucket/chk/orders_metrics")
    .outputMode("append")
    .trigger(processingTime="30 seconds")
    .start("s3://my-bucket/delta/orders_metrics"))

query.awaitTermination()

Notes that save you pain:

The same watermark on dedup and aggregation keeps state bounded.
Use business-key dedup (order_id) when source offsets don’t guarantee uniqueness.
Delta makes downstream consumption sane (schema, ACID). If you must write Parquet, own the concurrency story.

Designing for Correctness

Choose the Right Time

Event time (from payload) beats ingest time for out-of-order data.
Set a realistic watermark (e.g., 10 minutes). Too small → data loss. Too large → state bloat and lag.

Exactly-Once (What You Actually Get)

Spark ensures exactly-once processing relative to the sink it controls (e.g., Delta).
End-to-end exactly-once requires idempotent downstream behavior or transactional sinks. If your consumer double-counts append-only Parquet, that’s your bug, not Spark’s.

Checkpoint Hygiene

One checkpoint directory per query.
Don’t copy/move it between environments.
If you change the query’s topology (e.g., remove a stateful op), start fresh or be explicit about migration.

Performance & Cost Tuning (What Actually Moves the Needle)

Batch Size vs Trigger
- Default micro-batches can produce bursty loads.
- Use trigger(processingTime="30 seconds") to smooth compute and storage IO.
Shuffle & Partitions
- Tune spark.sql.shuffle.partitions (default ~200) to match cluster size.
- Avoid blowing up partitions with high-cardinality keys; pre-aggregate before wide shuffles when possible.
State Store Discipline
- Keep watermarks tight.
- Prefer keys that are naturally bounded (e.g., by time bucket).
- Monitor state metrics: numRowsTotal, memoryUsedBytes, batch duration.
Backpressure
- Let Spark auto-tune read rates, but cap when needed:
  maxOffsetsPerTrigger (Kafka) to protect sinks or slow consumers.
Serialization & Schema
- Define explicit schemas (avoid inferSchema) and cast once.
- Avoid UDFs in hot paths; use built-ins or Spark SQL.
Cluster Sizing
- Streaming prefers stability over peak throughput.
- Overprovision a little to keep processing time < trigger interval at P95.

Testing & Deploying Streams (No Cowboy Moves)

Unit tests with Memory sink for transformations; golden datasets for regressions.
Contract tests against Kafka topics (schemas, retention, partitions).
Load tests with replayed topics; watch checkpoint lag, batch duration, sink commit times.
Blue/Green: start a new consumer group; validate outputs side-by-side before cutover.

Common Pitfalls (and the fixes)

“We lost data!”
You set watermarks too small, or you dedup’d on a non-unique key. Fix: widen the watermark; use a true business key; add reconciliation logic.
Unbounded State / OOM
Late data combined with big windows or high-cardinality keys. Fix: reduce window size, bucket keys, increase watermarking, or aggregate earlier.
Checkpoint Corruption
Manual edits, accidental deletes, or reusing across envs. Fix: treat it like a database—back it up or recreate with discipline.
Small Files Hell
Writing tiny micro-batch files to S3 kills query engines. Fix: longer triggers, Delta with auto-optimize/compaction jobs, or periodic batch OPTIMIZE.
Exactly-Once Myths
You append to Parquet and your downstream double-counts on reprocess. Fix: write to transactional sinks or enforce idempotency in consumers.

Architecture Patterns You’ll Actually Use

Ingest & Clean: Kafka → Structured Streaming → Delta bronze
Real-Time Aggregates: Delta bronze → Streaming agg → Delta silver (1-min metrics)
Materialized Views for BI: Silver → Auto-refresh dashboards (e.g., Databricks SQL)
Feature Pipelines: Streaming join with reference data (broadcast) → Delta feature table
Replay & Backfill: Reprocess from Kafka offsets or Delta bronze with the same code paths

Minimal SQL Version (yes, pure SQL works)

-- Create a streaming table from Kafka (platform syntax may vary)
CREATE OR REPLACE STREAMING TABLE orders_raw
  TBLPROPERTIES (checkpointLocation = 's3://.../chk/orders_raw') AS
SELECT
  from_json(CAST(value AS STRING), schema_of_json('{"order_id":"x","user_id":"y","amount":0.0,"event_ts":"2024-01-01T00:00:00Z"}')) AS r
FROM KAFKA.`orders`;

-- Dedup + windowed metrics
CREATE OR REPLACE STREAMING TABLE orders_metrics
  TBLPROPERTIES (checkpointLocation = 's3://.../chk/orders_metrics') AS
SELECT
  window.start AS minute_start,
  window.end   AS minute_end,
  SUM(r.amount) AS revenue
FROM STREAM orders_raw
MATCH_CONDITION event_time BETWEEN window.start AND window.end
WITH WATERMARK event_time INTERVAL 10 MINUTES
GROUP BY window(r.event_time, '1 minute');

(Exact syntax differs by platform; keep the concepts: schema, watermark, window, checkpoint.)

Best Practices Checklist

One checkpoint per query, in durable storage
Watermark every stateful op (dedup, window, stream-stream joins)
Business-key dedup in addition to offsets
Delta (or transactional) sinks for correctness and compaction
Trigger that balances latency and small-file pressure
Metrics: processing time, input rows/s, state size, commit time
Load test with realistic skew and late data
Backfill story using the same code path

Internal Link Ideas

“Kafka Schema Evolution with Avro/Protobuf: Safe Upgrades”
“Delta Lake Compaction & Z-Ordering for Faster BI”
“Designing Idempotent Sinks for Exactly-Once Outcomes”
“Watermarking Deep Dive: Late Data Without Tears”
“Streaming Joins at Scale: Reference Data and Skew Busting”

Conclusion & Takeaways

Spark Structured Streaming gives you a unified batch + streaming engine with real-world reliability. Default to micro-batch, put checkpoints on durable storage, watermark every stateful step, and prefer Delta (or transactional sinks). Test with replayable data, size clusters for steady state, and keep state bounded. Do this, and your “near-real-time” stops being a slideware promise and becomes a dependable pipeline.

Call to action: Pick one real stream (orders, clicks, device telemetry). Implement the dedup + 1-minute revenue aggregation above. Ship it. Then iterate on watermark, trigger, and compaction until costs and latency are boring.

Image Prompt

“A clean, modern data architecture diagram showing a Kafka → Spark Structured Streaming → Delta Lake pipeline with checkpoints, watermarking, and a real-time aggregation node — minimalistic, high contrast, 3D isometric style.”

Data/ML Engineer Blog