Scala – Data/ML Engineer Blog

Scala for Data Engineers: When (and Why) It Beats Python

A practical guide to Scala for data engineers—where it shines (Spark, Flink, Kafka, Scio), how to structure projects, performance tips, and real-world code examples.

Why this matters

Most data teams default to Python. Fine. But when you live on distributed systems—Spark, Flink, Kafka Streams, Beam/Scio—Scala often gives you earlier API access, tighter typing, better performance, and fewer runtime surprises. If your pipelines push high throughput, low latency, or you maintain platform-critical jobs, you should at least be bilingual.

What Scala is (for a data engineer)

JVM language with functional + OO features, great concurrency tooling, and first-class support in major data engines.
Type safety prevents a class of bugs you only find in prod with dynamic languages.
Interoperability with Java gives you access to the huge JVM ecosystem (connectors, clients, SDKs).

Where you actually use Scala

Apache Spark (batch + streaming): richest APIs, best coverage for advanced features.
Apache Flink: low-latency streaming and stateful processing at scale.
Kafka Streams / Akka Streams: lightweight streaming services without a cluster.
Apache Beam via Scio (Spotify): Scala-first API over Beam runners (Dataflow, Spark, Flink).
Connector & platform services: when you need JVM performance and library support.

(Snowflake note: Snowpark supports Scala/Java/Python. Use Scala mainly if you’re already JVM-first or sharing code with Spark/Flink.)

Quick comparison: Python vs. Scala in DE

Use case	Python	Scala
Prototyping, glue, orchestration	✅ Fast to write	⚠️ Overkill
Spark feature depth & perf	⚠️ Good, sometimes lagging	✅ Best API coverage
Millisecond-latency streams (Flink/Kafka)	⚠️ Possible with PyFlink, limited	✅ Common/idiomatic
Strict schemas, compile-time safety	⚠️ Runtime errors	✅ Compile-time guarantees
Hiring/onboarding	✅ Larger pool	⚠️ Smaller pool, steeper curve

Project layout that doesn’t rot (sbt)

my-pipeline/
  build.sbt
  project/plugins.sbt
  src/
    main/scala/com/acme/jobs/ImpressionsDaily.scala
    main/resources/
    test/scala/...

build.sbt (Spark example):

ThisBuild / scalaVersion := "2.13.12"

lazy val root = (project in file("."))
  .settings(
    name := "impressions-pipeline",
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql" % "3.5.1" % Provided,
      "org.apache.spark" %% "spark-core" % "3.5.1" % Provided,
      "com.github.pureconfig" %% "pureconfig" % "0.17.6",
      "org.typelevel" %% "cats-core" % "2.10.0"
    )
  )

Example 1: Spark (Scala) daily aggregation to Delta/Iceberg

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

case class Impression(ts: String, user_id: String, campaign_id: String, cost: Double)

object ImpressionsDaily {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("impressions-daily").getOrCreate()
    import spark.implicits._

    // Strongly-typed schema via case class
    val raw = spark.read.option("header", "true").csv("s3://raw/impressions/*.csv")
      .as[Impression]

    val df = raw
      .withColumn("event_time", to_timestamp($"ts"))
      .withColumn("event_date", to_date($"event_time"))
      .groupBy($"event_date", $"campaign_id")
      .agg(
        countDistinct($"user_id").as("unique_users"),
        sum($"cost").as("spend"),
        count(lit(1)).as("impressions")
      )

    // Use whichever table format your platform standardizes on
    df.write
      .format("delta") // or "iceberg"
      .mode("overwrite")
      .partitionBy("event_date")
      .saveAsTable("ads.impressions_daily")

    spark.stop()
  }
}

Why Scala here? The typed Dataset (as[Impression]) catches schema breakage at compile time. Advanced functions and optimizations typically land first in Scala/Java.

Example 2: Flink (Scala) real-time dedup + windowing

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

case class Click(userId: String, ts: Long, url: String)

object ClicksStream {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    implicit val typeInfo = createTypeInformation[Click]

    val clicks = env
      .addSource(new KafkaClickSource("clicks")) // pseudo, replace with your Kafka source
      .keyBy(_.userId)
      .timeWindow(Time.seconds(10))
      .reduce { (a, b) =>
        if (a.ts >= b.ts) a else b // deduplicate by latest event
      }

    clicks.addSink(new KafkaClickSink("clicks_dedup"))
    env.execute("clicks-dedup")
  }
}

Why Scala here? Flink’s type system + Scala case classes make keyed state/window ops safer and faster than ad-hoc Python dict juggling.

Example 3: Beam via Scio (Google Dataflow runner)

import com.spotify.scio._
import com.spotify.scio.values.SCollection

object WordCount {
  def main(cmdlineArgs: Array[String]): Unit = {
    ScioContext(cmdlineArgs).withSavedContext { sc =>
      val counts: SCollection[(String, Long)] =
        sc.textFile("gs://data/input.txt")
          .flatMap(_.split("\\W+"))
          .filter(_.nonEmpty)
          .countByValue

      counts
        .map { case (w, c) => s"$w,$c" }
        .saveAsTextFile("gs://data/out/wordcount")
    }
  }
}

Why Scala here? Scio gives a concise, composable API on top of Beam with compile-time types and excellent IO transforms.

Patterns that pay off

Case classes for schemas
Encodes structure + nullability with Option[T]. Works with Spark Datasets and most serializers.
Pureconfig (or Typesafe Config) for configuration
Load config into typed case classes—no stringly-typed config drift.
Cats / ZIO (functional patterns)
Brings safer effects, retries, and resource handling. Use judiciously—don’t drown newcomers in typeclasses.
Module boundaries
- core: models + utils (no engine deps)
- spark-job / flink-job: engine-specific code
- connectors: Kafka/S3/…
  Keeps compile times down and dependencies clean.

Performance & reliability tips

Spark
- Prefer DataFrame/Dataset over RDDs.
- Avoid UDFs; if needed, use Spark SQL functions or TypedColumn; register UDFs in Scala/Java (faster than Python UDFs).
- Partition wisely; broadcast small tables, avoid groupByKey, and cache strategically.
- Enable Kryo serialization and set explicit encoders for complex types.
- For streaming, use triggering + checkpointing correctly; keep state stores small via TTL.
Flink
- Define watermarks and time characteristics explicitly.
- Keep keyed state bounded; use TTL and compaction.
- Use RocksDB state backend for large state; tune checkpoint intervals.
Build & deploy
- Pin Scala + Spark versions—binary compatibility matters (2.12 vs 2.13; Scala 3 still catching up across some libs).
- Shade/relocate conflicting deps (Jackson, Guava) in fat jars.
- Package as Docker with a slim JRE; pass config via env or files.
- CI: run property-based tests for transformations and mini-cluster/integration tests for jobs.

When not to use Scala

Your team is mostly analysts/ML engineers shipping lightweight jobs—Python is faster to iterate.
You need a quick one-off glue task, API scrape, or orchestration script.
Your managed platform’s primary SDK is Python and you’re not touching performance-sensitive paths.

Hiring & maintainability reality

Expect a smaller pool than Python/Java. Offset with clear style guides, code reviews focused on readability over cleverness, and onboarding notebooks that show how to run jobs locally and in CI.
Keep the functional magic approachable. Pattern matching and Option are great; ten nested typeclasses will tank adoption.

Suggested diagrams (for your blog)

“Language-to-engine” map: Spark ↔ Scala/Python, Flink ↔ Scala/Java, Beam(Scio) ↔ Scala, Kafka Streams ↔ Scala/Java.
Job architecture: Kafka → Flink (dedup, enrich) → Iceberg → Spark batch marts → BI.

Takeaways

If your platform is Spark/Flink/Kafka-heavy, Scala pays back with type safety, performance, and API depth.
Use Scala for core pipelines and streaming services where latency and correctness matter.
Keep Python for orchestration, ML, and glue—this isn’t religion; it’s economics.
Invest in build hygiene, schemas as code, and tests to keep Scala approachable for the wider team.

Data/ML Engineer Blog