Java – Data/ML Engineer Blog

Java for Data Engineers: When It Wins, When It Doesn’t

Python gets the hype, but the highest-throughput parts of modern data platforms—the stream processors, connectors, and long-lived services—are still dominated by the JVM. If you care about predictable latency, exactly-once semantics, and moving terabytes per hour without drama, Java is a power move.

Why this matters (for an intermediate DE)

Most serious streaming stacks (Flink, Kafka Streams, Beam runners) are built in Java or expose first-class Java APIs.
The connector ecosystem (Kafka Connect, Debezium, many Iceberg/Delta/Hudi writers) is JVM-centric.
If your team runs always-on services—ingestion gateways, schema registries, feature services—Java gives you performance, reliability, and tooling that ops teams trust.

You don’t have to be a “Java shop” to benefit. Use Java where it wins, not everywhere.

Where Java fits in a data platform

1) Real-time streaming & stateful processing

Apache Flink: event-time windows, keyed state, CEP, exactly-once sinks.
Kafka Streams: lightweight library for per-service topologies—great for enrichment, joins, aggregations on Kafka topics.
Apache Beam (Java SDK): write once, run on Dataflow/Flink/Spark runners.

Use Java when: you need sub-second latency, big keyed state, or strict semantics; you want the widest operator/sink coverage and best performance.

2) Connectors, ingestion, and CDC

Kafka Connect + Debezium (CDC) run on the JVM. Many enterprise connectors (JDBC, JDBC-to-Kafka, S3, Snowflake, Elasticsearch, etc.) are Java.
High-throughput producers/consumers (Avro/Protobuf + Schema Registry) are most mature in Java.

Use Java when: building custom connectors, transforming CDC streams, or pushing 100k+ msgs/sec reliably.

3) Batch at scale (when you need it)

Spark’s Java API exists (Scala is usually nicer), and Beam covers batch pipelines well.
For table formats like Apache Iceberg/Delta/Hudi, Java libraries and catalog clients are first-class.

Use Java when: you need the table-format APIs directly from services, or you’re standardizing on Beam.

4) Data platform services

Spring Boot + gRPC/HTTP for ingestion gateways, data contracts services, schema registries, feature serving, authorization and quota checks.
Strong observability (Micrometer, OpenTelemetry), deployment ergonomics, and long-term maintainability.

Use Java when: you want a small, boring service to run for years with consistent memory and GC behavior.

When not to use Java

Ad-hoc analytics and one-off jobs → SQL/dbt or Python.
Data science/ML research → Python.
Spark-heavy teams that prioritize dev ergonomics → Often Scala (Datasets/typed ops) or PySpark (faster iteration) beats Java.

Core ecosystem (DE-relevant)

Build: Gradle or Maven.
Serialization: Avro, Protobuf, JSON; Schema Registry for evolution.
Streaming: Flink, Kafka Streams, Beam.
Messaging/Storage: Kafka, Kinesis, Pub/Sub; S3/GCS/ADLS; Iceberg/Delta/Hudi; Parquet/ORC.
Observability: Micrometer, OpenTelemetry, Prometheus, JFR.
CI/CD: JUnit5, Testcontainers, WireMock; Docker images with distroless or Alpine.

Architecture patterns (quick sketches)

CDC → Kafka → Flink → Iceberg
- Debezium streams DB changes into Kafka.
- Flink enriches + upserts to Iceberg tables on S3.
- Downstream: Trino/Presto/Spark/warehouse query those tables.
API Gateway → Kafka Streams → Snowflake
- Spring Boot service validates payloads (Protobuf, contract checks) and publishes to Kafka.
- Kafka Streams performs joins/aggregations and writes to sink topics.
- A connector (or Snowpipe-style ingress) loads curated topics into Snowflake.
Feature Service (online)
- Java service reads features from RocksDB/state (Flink) or Redis, serves gRPC with strict p95 latency.

Code: two tiny, realistic examples

A. Kafka Streams topology (enrich orders with customers)

public class EnrichTopology {
  public Topology build(Serde<String> keySerde,
                        Serde<Order> orderSerde,
                        Serde<Customer> customerSerde,
                        Serde<EnrichedOrder> enrichedSerde) {

    StreamsBuilder b = new StreamsBuilder();

    KStream<String, Order> orders =
        b.stream("orders", Consumed.with(keySerde, orderSerde));

    KTable<String, Customer> customers =
        b.table("customers", Consumed.with(keySerde, customerSerde));

    KStream<String, EnrichedOrder> enriched =
        orders.join(customers,
            (order, customer) -> EnrichedOrder.from(order, customer),
            Joined.with(keySerde, orderSerde, customerSerde));

    enriched.to("orders_enriched", Produced.with(keySerde, enrichedSerde));
    return b.build();
  }
}

Why this is good: small, deployable, exactly-once with proper configs, easy to test with TopologyTestDriver.

B. Flink job with event-time windows

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().enableObjectReuse();

Properties props = new Properties();
props.put("bootstrap.servers", System.getenv("KAFKA"));
FlinkKafkaConsumer<Order> source =
    new FlinkKafkaConsumer<>("orders", new OrderDeserSchema(), props);
source.setStartFromLatest();

DataStream<Order> orders = env
    .addSource(source)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy
          .<Order>forBoundedOutOfOrderness(Duration.ofSeconds(30))
          .withTimestampAssigner((o, ts) -> o.eventTimeMs()));

orders
  .keyBy(Order::customerId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .reduce(Order::merge)
  .addSink(new IcebergSink<>(/* catalog + table conf */));

env.execute("orders-5m-agg");

Why this is good: proper event time, bounded lateness, keyed state, and a clear place to hook an Iceberg sink.

Best practices (hard-earned)

Prefer Protobuf/Avro + Schema Registry. Enforce compatibility in CI. JSON for debug only.
Treat time seriously. Use event time and watermarks; wall-clock time corrupts aggregations.
Exactly-once is config, not magic. Align producer acks, idempotent producers, transactional writes, checkpointing. Test failovers.
Manage GC & memory. Fixed heap sizes, G1/ZGC, object reuse in Flink (enableObjectReuse()), avoid accidental boxing.
Backpressure wins over dropping. Monitor operator latencies; size thread pools; set bounded queues.
Container discipline. Small alpine/distroless images; JDK 17+; don’t run with -XX:+UseContainerSupport off.
Observability first. Emit business metrics (events/sec, lag, dedupe rate), not just CPU. Trace critical paths.
Contract tests. Spin up Kafka, Schema Registry, and sinks via Testcontainers in CI. Break the build on incompatible schemas or sink errors.

Common pitfalls (avoid these)

“At-least once is fine.” Then you join duplicates and corrupt revenue. Implement idempotency keys.
Ignoring watermarking. Late data will either bloat state or vanish from windows.
Huge fat JARs with dependency hell. Shade wisely, keep versions aligned with the runtime (Flink/Beam provide BOMs).
Mixing business logic and plumbing. Keep serde/connectors/config separate from transforms and aggregations.
Verbose configs copy-pasted across services. Centralize defaults with Spring Config or library modules.

Java vs Scala vs Python (quick comparison)

Use case	Java	Scala	Python
Flink / Kafka Streams	Best API coverage, perf	Also strong (Flink)	Limited / wrappers
Beam	Mature SDK	OK	Good ergonomics, slower runners
Spark	Works, verbose	Best (typed Datasets)	Most popular for dev speed
Long-running services	Best (Spring Boot, Micrometer)	Good	Possible, but ops heavier
Ad-hoc & ML	Meh	Meh	Best

Where this meets warehouses (Snowflake/BigQuery/etc.)

Use connectors to land curated Kafka topics into your warehouse.
For Snowflake, pair streaming with Snowpipe-style ingestion or a vendor connector; keep schemas compatible (Protobuf/Avro → warehouse types).
Preserve record keys, event time, and change flags (INSERT/UPDATE/DELETE) so downstream ELT stays correct.

Conclusion: Keep Java in your toolbox

If your platform does real-time work, needs custom ingestion, or runs always-on data services, Java pays for itself with performance and reliability. Use SQL and Python for what they’re great at; reach for Java when throughput, latency, and uptime matter more than scripting speed.

Data/ML Engineer Blog