Streaming vs Batch Processing: The Real Tradeoffs Nobody Talks About

TL;DR: Streaming infrastructure costs 3-10x more than batch, requires specialized operational knowledge, and introduces failure modes that batch pipelines simply don't have. Most teams that adopt streaming don't actually need sub-second latency. Micro-batch at 1-15 minute intervals covers 80% of "real-time" use cases at a fraction of the cost. I learned this the hard way by over-engineering a streaming platform that a well-tuned Airflow DAG could have handled.

Key Takeaways

  • Streaming is genuinely needed for fraud detection, live dashboards, operational alerting, and user-facing recommendations. Everything else deserves scrutiny.
  • The cost gap is real: running Kafka + Flink in production costs 3-10x more than equivalent batch infrastructure, and that's before you count the engineering hours.
  • Micro-batch (Spark Structured Streaming, dbt with short intervals, or simple cron jobs) covers the vast majority of "we need real-time" requests.
  • Lambda architecture is not dead, just misunderstood. Kappa sounds elegant but creates operational nightmares for historical reprocessing.
  • The complexity tax on streaming shows up in debugging, testing, schema evolution, and on-call burden — not in the initial build.

How I Learned This the Expensive Way

Three years ago, I led a project to rebuild our analytics pipeline as a "real-time streaming platform." The pitch was compelling: instead of waiting for nightly batch runs, stakeholders would see metrics update within seconds. The executive sponsor loved it. The architecture diagram looked beautiful on a whiteboard. Kafka, Flink, a real-time feature store, streaming joins — the whole nine yards.

Six months later, we had a system that was technically impressive and operationally devastating. Our on-call rotation went from "check Airflow once in the morning" to "get paged at 3 AM because a Flink checkpoint failed and consumer lag is spiking." The monthly AWS bill tripled. And the kicker? When I actually talked to the analysts using the data, most of them ran their queries once a day, in the morning, with coffee. They didn't need sub-second freshness. They needed data that was correct and available by 8 AM.

That experience fundamentally changed how I think about streaming vs batch processing. Not because streaming is bad — it's an incredible tool when you genuinely need it. But because the industry has a hype problem, and too many teams are adopting streaming architectures for problems that batch solves better, cheaper, and more reliably.

When You Actually Need Streaming

Let me be specific about when real-time data processing is genuinely the right call. I keep a mental checklist, and the use case needs to hit at least two of these criteria:

  1. The business action happens in seconds. Fraud detection that blocks a transaction. A recommendation engine that updates mid-session. An IoT system that triggers an alarm. If the response to the data needs to happen before the user closes the tab, you need streaming.
  2. The data is naturally unbounded. Clickstreams, sensor readings, log events — data that arrives continuously and has no natural "end of batch" boundary. Forcing this into batch windows creates awkward late-arrival handling.
  3. Staleness has a measurable cost. Not "it would be nice to see it sooner," but "every minute of delay costs us $X in fraud losses" or "users churn when recommendations are 24 hours stale." If you can't quantify the cost of latency, you probably don't need streaming.
  4. Volume requires incremental processing. When you're processing billions of events per day and reprocessing the full dataset every hour is computationally infeasible. Streaming lets you process incrementally, touching only new data.

Here's the uncomfortable truth: most internal analytics, BI dashboards, ML feature pipelines, and data warehouse loads don't meet these criteria. "Our stakeholders want real-time dashboards" almost always means "our stakeholders want dashboards that refresh every 15 minutes instead of every 24 hours." That's not streaming. That's a faster batch schedule.

The Cost Nobody Warns You About

Let me break down the real cost of running a streaming architecture versus batch, because this is where the conference talks conveniently get vague.

Cost Category Batch (Airflow + Spark) Streaming (Kafka + Flink) Multiplier
Compute (always-on vs ephemeral) Spot instances, scale to zero 24/7 TaskManagers, reserved capacity 3-5x
Message broker None (direct reads) Kafka cluster (3+ brokers, ZK/KRaft) +$2-8k/mo
State management Warehouse handles it RocksDB, checkpoint storage, state migration +30% ops
Engineering hours (build) 1-2 weeks per pipeline 3-6 weeks per pipeline 2-3x
Engineering hours (maintain) ~2 hrs/week on-call ~8-15 hrs/week on-call 4-7x
Incident complexity Rerun the DAG Restore from checkpoint, handle offset reset, reprocess window Hard to quantify

The compute line is the one that bites hardest. Batch workloads are ephemeral: spin up a Spark cluster, process the data, tear it down. You can use spot instances. You can scale to zero overnight. Streaming workloads run 24/7. Your Flink TaskManagers need to be provisioned for peak throughput at all times, including 3 AM when nobody is looking at the dashboard. That idle capacity is pure waste for most use cases.

A mid-size company I consulted for last year was spending $14,000/month on their Kafka + Flink setup for a pipeline that processed user events into analytics aggregates. We replaced it with a Spark batch job running every 15 minutes on spot instances. Monthly cost dropped to $2,100. Same data, same destination tables, effectively the same freshness for their actual use case.

The Same Pipeline: Batch vs Streaming

Let's make this concrete. Here's a common scenario: aggregate user page views into hourly counts per page, and write the results to a warehouse table. First, the batch approach.

Batch Version (PySpark + Airflow)

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

def aggregate_page_views(spark, run_date: str):
    """Batch job: reads a day's events, aggregates, writes to warehouse."""
    events = (
        spark.read
        .parquet(f"s3://data-lake/raw/page_views/dt={run_date}")
        .filter(F.col("event_type") == "page_view")
    )

    hourly_counts = (
        events
        .withColumn("hour", F.hour("event_timestamp"))
        .groupBy("page_url", "hour")
        .agg(
            F.count("*").alias("view_count"),
            F.countDistinct("user_id").alias("unique_users"),
            F.avg("time_on_page_seconds").alias("avg_time_on_page"),
        )
        .withColumn("run_date", F.lit(run_date))
    )

    hourly_counts.write.mode("overwrite").partitionBy("run_date").parquet(
        "s3://data-warehouse/page_view_hourly/"
    )
    return hourly_counts.count()

# Airflow DAG calls this once per day (or every hour, or every 15 min)
# If it fails: just rerun. Idempotent by design.

Simple. Readable. Testable with a local SparkSession and sample data. If it fails, you rerun it. The overwrite mode on a date partition means reruns are naturally idempotent. Any junior engineer on your team can debug this.

Streaming Version (Flink + Kafka)

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
from pyflink.table.window import Tumble
from pyflink.table import expressions as E

env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60000)  # checkpoint every 60 seconds
env.get_checkpoint_config().set_min_pause_between_checkpoints(30000)

t_env = StreamTableEnvironment.create(env)

# Define Kafka source
t_env.execute_sql("""
    CREATE TABLE page_view_events (
        user_id STRING,
        page_url STRING,
        event_type STRING,
        time_on_page_seconds DOUBLE,
        event_timestamp TIMESTAMP(3),
        WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' MINUTE
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'page-views',
        'properties.bootstrap.servers' = 'kafka:9092',
        'properties.group.id' = 'page-view-aggregator',
        'format' = 'json',
        'scan.startup.mode' = 'latest-offset'
    )
""")

# Define sink
t_env.execute_sql("""
    CREATE TABLE page_view_hourly (
        page_url STRING,
        window_start TIMESTAMP(3),
        window_end TIMESTAMP(3),
        view_count BIGINT,
        unique_users BIGINT,
        avg_time_on_page DOUBLE
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://warehouse:5432/analytics',
        'table-name' = 'page_view_hourly',
        'driver' = 'org.postgresql.Driver'
    )
""")

# Tumbling window aggregation
t_env.execute_sql("""
    INSERT INTO page_view_hourly
    SELECT
        page_url,
        window_start,
        window_end,
        COUNT(*) AS view_count,
        COUNT(DISTINCT user_id) AS unique_users,
        AVG(time_on_page_seconds) AS avg_time_on_page
    FROM TABLE(
        TUMBLE(TABLE page_view_events, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)
    )
    WHERE event_type = 'page_view'
    GROUP BY page_url, window_start, window_end
""")

More code. More concepts (watermarks, checkpointing, consumer groups, windowing semantics). And this is the simple version. I haven't shown you what happens when you need to handle late-arriving data, deal with schema changes in the Kafka topic, or recover from a corrupted checkpoint. Each of those is a production incident waiting to happen.

Now ask yourself: if the end users are happy with hourly granularity, which version would you rather debug at 2 AM?

The Complexity Tax Is Hidden

The initial build of a streaming pipeline is maybe 2-3x harder than batch. But the ongoing complexity tax is where it really gets you. Here's what I mean:

Testing

Batch pipelines are functions: input data in, output data out. You can write a unit test with a DataFrame, run the transformation, and assert on the result. Streaming pipelines are stateful, time-dependent processes. Testing a windowed aggregation means simulating event time, watermark advancement, and late arrivals. Most teams give up and just test in staging. That's not testing — that's praying.

Schema Evolution

In batch, you update the schema in your source table, adjust your Spark job, and rerun. The old partitions still have the old schema, and Spark handles schema merging gracefully. In streaming, a schema change in a Kafka topic can break every downstream consumer simultaneously. You need schema registries (Confluent, AWS Glue), compatibility modes (backward, forward, full), and a deployment process that coordinates producer and consumer changes. It's an entire operational discipline.

Reprocessing

This is the big one. In batch, reprocessing is trivial: backfill the date range by rerunning the DAG for those dates. Done. In streaming, reprocessing means either resetting consumer offsets to replay historical data through your streaming job (which fights with your windowing logic and state) or maintaining a separate batch pipeline for backfills. If you choose the second option, congratulations — you've just built the Lambda architecture you were trying to avoid.

Debugging Production Issues

Batch failure: "The job for 2026-01-05 failed at the transform step. Here's the stack trace. Rerun it." Streaming failure: "Consumer lag has been increasing for the past 3 hours. Is it a slow consumer? Increased throughput on the producer side? A checkpoint that's taking too long? A GC pause? A rebalance? The answer might be in the TaskManager logs, or the JobManager logs, or the Kafka broker logs, or the ZooKeeper logs, or the metrics from 3 hours ago that have already been rotated out of your monitoring window."

The Micro-Batch Middle Ground

Here's what I recommend for 80% of teams who think they need streaming: micro-batch processing. Run your batch pipeline more frequently. It sounds too simple to be a real solution, and that's exactly why it works.

# Spark Structured Streaming in micro-batch mode
# This gives you "near real-time" with batch simplicity

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("micro-batch-page-views").getOrCreate()

# Read from source as a stream, but process in micro-batches
stream = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "page-views")
    .option("startingOffsets", "latest")
    .load()
)

parsed = (
    stream
    .selectExpr("CAST(value AS STRING) as json_str")
    .select(F.from_json("json_str", schema).alias("data"))
    .select("data.*")
    .withWatermark("event_timestamp", "10 minutes")
)

# Same aggregation logic as batch — familiar, testable
hourly = (
    parsed
    .groupBy(
        F.window("event_timestamp", "1 hour"),
        "page_url"
    )
    .agg(
        F.count("*").alias("view_count"),
        F.approx_count_distinct("user_id").alias("unique_users"),
        F.avg("time_on_page_seconds").alias("avg_time_on_page"),
    )
)

# Trigger every 5 minutes — not true streaming, not daily batch
query = (
    hourly.writeStream
    .outputMode("update")
    .format("parquet")
    .option("path", "s3://data-warehouse/page_view_hourly/")
    .option("checkpointLocation", "s3://checkpoints/page-view-hourly/")
    .trigger(processingTime="5 minutes")
    .start()
)

This gives you 5-minute data freshness using the same Spark APIs your team already knows. The processing model is still batch internally — Spark reads a micro-batch of records, processes them as a DataFrame, writes the output. You get the latency benefits of "near real-time" without the operational complexity of a true streaming engine. If it fails, the checkpoint lets you resume from where you left off. If you need to reprocess, you can still run a standard batch job over the historical data.

For even simpler cases, consider just running your existing batch pipeline on a tighter schedule:

-- dbt model running every 15 minutes via Airflow/cron
-- Incremental: only processes new events since last run

{{ config(
    materialized='incremental',
    unique_key='page_url_hour',
    incremental_strategy='merge'
) }}

SELECT
    page_url,
    DATE_TRUNC('hour', event_timestamp) AS event_hour,
    page_url || '_' || DATE_TRUNC('hour', event_timestamp)::text AS page_url_hour,
    COUNT(*) AS view_count,
    COUNT(DISTINCT user_id) AS unique_users,
    AVG(time_on_page_seconds) AS avg_time_on_page,
    CURRENT_TIMESTAMP AS processed_at
FROM {{ source('raw', 'page_view_events') }}
{% if is_incremental() %}
WHERE event_timestamp > (SELECT MAX(event_hour) FROM {{ this }})
{% endif %}
GROUP BY page_url, DATE_TRUNC('hour', event_timestamp)

Fifteen-minute freshness. No Kafka. No Flink. No checkpoint management. Your existing dbt project and Airflow scheduler handle it. The incremental model only processes new data, so it runs in seconds even at scale. This is the solution that nobody gets promoted for building, and it's the right answer for most teams.

Lambda vs Kappa: What Actually Happens in Practice

Every streaming architecture talk eventually brings up the Lambda architecture (batch + streaming layers) versus the Kappa architecture (streaming only). The theory is clean. The practice is messy.

Lambda Architecture in Reality

Lambda says: run a batch pipeline for accuracy and a streaming pipeline for speed. Merge the results at query time. The textbook criticism is "you're maintaining two pipelines that do the same thing." That criticism is valid. But here's what the textbooks don't mention: almost every mature data platform I've seen ends up as an accidental Lambda architecture anyway. You have a streaming pipeline for "real-time" metrics, and then someone needs to backfill historical data, so they write a batch job. Now you have two pipelines. The question isn't whether you'll have both — it's whether you planned for it.

Kappa Architecture in Reality

Kappa says: just use streaming for everything. Replay the Kafka log for reprocessing. It sounds elegant. In practice, I've seen it break down for three reasons:

  1. Kafka retention costs. Keeping months of raw events in Kafka (or Kafka-tiered storage) gets expensive fast. Most teams set retention to 7-30 days, which means you can't replay from the beginning of time.
  2. Reprocessing at scale is slow. Replaying 6 months of events through a Flink job that was designed for real-time throughput means either massively over-provisioning or waiting days. Spark would chew through the same data in hours using batch-optimized reads from a data lake.
  3. Logic changes require full replay. Changed your aggregation window from 1 hour to 30 minutes? In Kappa, you need to replay all historical data through the new logic. In batch, you just rerun the backfill with the new code.

My pragmatic advice: build batch-first. If and when you identify a use case that genuinely needs sub-minute latency, add a streaming layer for that specific use case. Accept that you'll have two pipelines for that data domain. Make sure they both write to the same output schema so the serving layer doesn't care which one produced the data. This isn't architecturally pure, but it works, it's debuggable, and your on-call engineers will thank you.

The Decision Framework

When a stakeholder comes to me with a "we need real-time" request, I walk them through this sequence of questions. It's saved me from over-engineering more than once.

  1. What is the business action that depends on this data? If the answer is "we look at a dashboard" — that's not real-time, that's reporting. Batch or micro-batch will do.
  2. What's the actual required latency? Push them to give you a number. "Real-time" isn't a number. Is it 100ms? 1 second? 5 minutes? 1 hour? The answer usually isn't what you expect. Most "real-time" requests are really "faster than the current 24-hour batch."
  3. What's the cost of stale data, in dollars? If 15-minute-old data and 1-second-old data produce the same business outcome, you don't need streaming. If every second of delay costs measurable money (fraud, SLA violations, lost sales), that's a real streaming use case.
  4. Can your team operate it? Do you have engineers who understand Kafka consumer groups, Flink state backends, watermark propagation, and checkpoint recovery? If the answer is "we'll learn," budget 3-6 months of painful ramp-up and expect production incidents during that period.
  5. What happens when it breaks? Batch recovery is "rerun the job." Streaming recovery might mean restoring from a checkpoint that's 2 hours old and accepting data duplication or loss for the gap. If your business can't tolerate that, you need exactly-once semantics, which adds another layer of complexity and cost.

If the answers to questions 1-3 all point to "yes, we need sub-second freshness, and stale data has a quantifiable cost," and the answer to 4 is "yes, we have the expertise," then build a streaming pipeline. Otherwise, start with micro-batch and iterate.

What I'd Tell My Younger Self

If I could go back to the me who was pitching that streaming platform three years ago, I'd say this:

The goal is not to build the most technically impressive architecture. The goal is to deliver reliable data at the freshness the business actually needs, at a cost the business can sustain, maintained by the team you actually have. Batch is not a lesser technology — it's a more mature, more predictable, more cost-effective one. Choose streaming when the business case demands it, not when the conference talks inspire it.

The best data engineers I know aren't the ones who can build the most complex streaming topology. They're the ones who can look at a problem and choose the simplest architecture that solves it. Sometimes that's Kafka and Flink. Usually it's Airflow and Spark, running a little more frequently than before.

Practical Recommendations

For teams evaluating batch vs streaming architecture in 2026, here's what I'd actually do:

  • Default to batch. Start every new pipeline as a batch job. Optimize for correctness and simplicity first.
  • Increase frequency before switching paradigms. If daily isn't fresh enough, try hourly. If hourly isn't enough, try every 15 minutes with incremental models. You'll be surprised how far this takes you.
  • Use micro-batch as the stepping stone. Spark Structured Streaming with a trigger interval of 1-5 minutes gives you near-real-time with batch-like operational characteristics.
  • Reserve true streaming for the edges. Fraud detection, real-time personalization, operational monitoring, IoT alerting. These are valid. Internal analytics dashboards are not.
  • Budget for the full cost. If you do adopt streaming, budget 3-5x your batch infrastructure cost and 2-3x your engineering headcount for that pipeline. If the business case still makes sense at those numbers, go for it.
  • Keep a batch fallback. Even with streaming, maintain the ability to reprocess data in batch. You will need it. Schema changes, bug fixes, new business logic — all require historical reprocessing that streaming handles poorly.

The streaming vs batch decision isn't about which technology is "better." It's about matching the architecture to the actual requirements, the actual budget, and the actual team. The right answer is almost always simpler than you think.

Leave a Comment