Rust for Data Engineering: Why Polars, DataFusion, and Delta-rs Are Just the Beginning

Rust for Data Engineering: Why Polars, DataFusion, and Delta-rs Are Just the Beginning

Six months ago, I would have told you that Rust was a systems programming language with no real place in a data engineer's toolkit. I had my Python, my Spark clusters, my Airflow DAGs — why would I need a language famous for fighting the borrow checker? Then I tried Polars on a pipeline that was taking 47 minutes in pandas. It finished in 3 minutes and 12 seconds. That moment broke something in my brain, and I've been falling down the Rust data engineering rabbit hole ever since.

This isn't a "rewrite everything in Rust" manifesto. It's an honest account of what I've found: where Rust genuinely transforms data engineering work, where it's overkill, and why the ecosystem is growing faster than most people realize.

Why Rust Is Eating Data Tooling

Before diving into specific tools, it helps to understand why Rust keeps showing up in data infrastructure. Three properties make it uniquely suited for the work.

Predictable Performance Without a Garbage Collector

Python's GC pauses are mostly invisible in web apps. In data pipelines processing millions of rows per second, they're not. I've seen Spark jobs where GC overhead consumed 30% of total runtime. Rust's ownership model handles memory deterministically — objects are freed the moment they go out of scope. No stop-the-world pauses, no tuning GC generations, no surprises at 3 AM when your batch job runs 4x slower because the heap got fragmented.

Memory Safety Without Runtime Cost

Data pipelines deal with messy inputs: null values in unexpected columns, malformed timestamps, integer overflows on aggregation. In C/C++, these bugs cause segfaults or silent data corruption. In Python, they cause exceptions (if you're lucky) or wrong results (if you're not). Rust catches most of these at compile time. The Option type forces you to handle nulls. Integer overflow panics in debug mode. It's annoying during development but means your pipeline doesn't silently produce garbage at 2 AM.

Zero-Cost Abstractions and Parallelism

Rust's rayon crate gives you data parallelism with a single method call change — swap .iter() for .par_iter() and your code runs across all cores. No GIL. No multiprocessing overhead. No serialization between processes. This matters enormously for data engineering, where workloads are often embarrassingly parallel.

The Rust Data Ecosystem: A Tour

The ecosystem is larger than most people think. Here's what's mature enough to use in production today, and what's promising but early.

Apache Arrow-rs: The Foundation

Almost everything in this article builds on arrow-rs, the Rust implementation of Apache Arrow. It provides the columnar memory format that enables zero-copy data sharing between tools. If you're writing any data processing code in Rust, you'll interact with Arrow arrays and record batches constantly.

use arrow::array::{Float64Array, StringArray};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use std::sync::Arc;

fn create_batch() -> RecordBatch {
    let schema = Schema::new(vec![
        Field::new("city", DataType::Utf8, false),
        Field::new("temperature", DataType::Float64, false),
    ]);

    let cities = StringArray::from(vec!["Moscow", "Berlin", "Tokyo", "São Paulo"]);
    let temps = Float64Array::from(vec![-12.5, 2.3, 8.1, 28.4]);

    RecordBatch::try_new(
        Arc::new(schema),
        vec![Arc::new(cities), Arc::new(temps)],
    )
    .expect("Failed to create RecordBatch")
}

Arrow-rs is battle-tested. It's a core Apache project with active maintenance. This is not experimental software.

Polars: The DataFrame Library That Started It All

Polars is what got most data engineers to pay attention to Rust. Built by Ritchie Vink, it's a DataFrame library that routinely benchmarks 10-50x faster than pandas, with a query optimizer that rewrites your expressions for maximum performance.

What makes Polars special isn't just speed — it's the lazy evaluation engine. When you chain operations, Polars builds a logical plan, optimizes it (predicate pushdown, projection pruning, join reordering), and only executes when you call .collect(). It's essentially a query planner for DataFrames.

use polars::prelude::*;

fn analyze_transactions() -> Result {
    let df = LazyFrame::scan_parquet("transactions/*.parquet", Default::default())?
        .filter(col("amount").gt(lit(1000)))
        .group_by([col("merchant_category")])
        .agg([
            col("amount").sum().alias("total_spend"),
            col("amount").mean().alias("avg_transaction"),
            col("transaction_id").count().alias("num_transactions"),
        ])
        .sort(
            ["total_spend"],
            SortMultipleOptions::default().with_order_descending(true),
        )
        .collect()?;

    Ok(df)
}

Most data engineers use Polars through its Python bindings, and that's perfectly fine. But writing Polars in native Rust gives you even more control — custom expressions, embedded pipelines, and the ability to compile your entire ETL into a single binary with no Python runtime dependency.

DataFusion: SQL Engine You Can Embed Anywhere

DataFusion is an extensible SQL query engine built on Arrow-rs. Think of it as SQLite for analytics — you can embed a full SQL engine in your Rust application. It supports reading from Parquet, CSV, JSON, and custom data sources. The query planner is sophisticated: it handles joins, window functions, CTEs, and subqueries.

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Register a Parquet file as a table
    ctx.register_parquet(
        "events",
        "s3://data-lake/events/2026/01/*.parquet",
        ParquetReadOptions::default(),
    )
    .await?;

    // Run SQL queries directly
    let df = ctx
        .sql(
            "SELECT
                date_trunc('hour', event_time) AS hour,
                event_type,
                COUNT(*) AS event_count,
                COUNT(DISTINCT user_id) AS unique_users
             FROM events
             WHERE event_time >= '2026-01-01'
             GROUP BY 1, 2
             ORDER BY 1, 3 DESC",
        )
        .await?;

    // Write results back to Parquet
    df.write_parquet(
        "output/hourly_events.parquet",
        DataFrameWriteOptions::default(),
        None,
    )
    .await?;

    println!("Query complete.");
    Ok(())
}

I've started using DataFusion for data quality checks. Instead of spinning up a Spark cluster to validate a table, I run a Rust binary that reads the Parquet files directly and executes validation SQL. It starts in milliseconds, not minutes.

Delta-rs: Delta Lake Without the JVM

Delta-rs is a native Rust implementation of the Delta Lake protocol. This is huge. Before delta-rs, interacting with Delta tables required Spark or at minimum a JVM. Now you can read, write, optimize, and vacuum Delta tables from a lightweight Rust binary — or from Python via the deltalake package.

I use delta-rs in production for two things: compacting small files (the OPTIMIZE operation) and running VACUUM to clean up old versions. Both previously required a Spark cluster to be running. Now they're cron jobs that run in seconds.

Lance: The Newcomer for ML Data

Lance is a columnar data format optimized for ML workloads — think embeddings, images, and point cloud data. It's built in Rust and provides random access reads that are 100x faster than Parquet for ML training data loading. If you work with vector embeddings or multimodal data, keep an eye on Lance.

Ballista: Distributed DataFusion

Ballista extends DataFusion to run across multiple nodes. It's essentially a distributed SQL engine in Rust. It's still maturing, but the architecture is sound — scheduler nodes distribute query fragments to executor nodes, with Arrow Flight for data transfer. This could eventually become a Rust-native alternative to Spark for certain workloads.

When to Write Rust vs. When to Use Python Bindings

This is the practical question that matters most. After six months of mixing Rust and Python in data work, here's my decision framework:

Scenario Recommendation Why
Ad-hoc analysis, exploration Python (Polars/pandas) Iteration speed matters more than runtime
Scheduled batch pipeline Python with Polars Good enough performance, easier to maintain
High-frequency streaming transform Rust Latency and memory predictability are critical
CLI tool for data ops Rust Single binary, instant startup, no dependency hell
Custom data format reader Rust + PyO3 Write the parser in Rust, expose Python API
Orchestration logic (Airflow/Dagster) Python Rust adds no value to "call these APIs in order"
Hot path in existing Python pipeline Rust + PyO3 Rewrite only the bottleneck, keep Python everywhere else
One-off migration script Python You'll run it once; compile times aren't worth it

The pattern I've settled on: prototype in Python, profile, and rewrite the hot path in Rust if needed. Most pipelines don't need Rust. The ones that do see dramatic improvements.

PyO3: Building Python Extensions in Rust

PyO3 is the bridge that makes the "best of both worlds" strategy practical. It lets you write Rust functions and expose them as a Python module. Your team keeps writing Python; they just call your fast Rust function instead of the slow Python one.

Here's a real example. I had a pipeline step that parsed and validated 50 million semi-structured log lines. In Python with regex, it took 22 minutes. I rewrote just the parsing function in Rust:

// src/lib.rs
use pyo3::prelude::*;
use pyo3::types::PyDict;

#[pyfunction]
fn parse_log_lines(lines: Vec) -> PyResult> {
    Python::with_gil(|py| {
        let mut results = Vec::with_capacity(lines.len());

        for line in &lines {
            let dict = PyDict::new(py);

            // Split on first pipe characters for our log format:
            // timestamp|level|service|message
            let parts: Vec<&str> = line.splitn(4, '|').collect();
            if parts.len() == 4 {
                dict.set_item("timestamp", parts[0].trim())?;
                dict.set_item("level", parts[1].trim())?;
                dict.set_item("service", parts[2].trim())?;
                dict.set_item("message", parts[3].trim())?;
                dict.set_item("valid", true)?;
            } else {
                dict.set_item("raw", line.as_str())?;
                dict.set_item("valid", false)?;
            }

            results.push(dict.into());
        }

        Ok(results)
    })
}

#[pymodule]
fn log_parser(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(parse_log_lines, m)?)?;
    Ok(())
}

And the Python side stays clean:

# pipeline.py
import log_parser  # our compiled Rust module

def process_logs(file_path: str) -> list[dict]:
    with open(file_path) as f:
        lines = f.readlines()

    # This call executes Rust code — 50M lines in ~90 seconds
    parsed = log_parser.parse_log_lines(lines)

    valid = [r for r in parsed if r["valid"]]
    invalid = [r for r in parsed if not r["valid"]]

    print(f"Parsed {len(valid)} valid, {len(invalid)} invalid lines")
    return valid

Build with maturin develop and the Rust code compiles into a .so that Python imports like any other module. The 22-minute step now runs in 90 seconds. My team didn't need to learn Rust — they just call log_parser.parse_log_lines().

A Simple ETL Pipeline in Rust

To give you a feel for what a complete (if small) ETL job looks like in Rust, here's a pipeline that reads CSV sales data, transforms it, and writes Parquet output:

use datafusion::prelude::*;
use std::time::Instant;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let start = Instant::now();
    let ctx = SessionContext::new();

    // Extract: read raw CSV
    ctx.register_csv(
        "raw_sales",
        "data/sales_2026_q1.csv",
        CsvReadOptions::new()
            .has_header(true)
            .schema_infer_max_records(10000),
    )
    .await?;

    // Transform: clean, aggregate, derive new columns
    let transformed = ctx
        .sql(
            "SELECT
                product_category,
                region,
                DATE_TRUNC('week', CAST(sale_date AS TIMESTAMP)) AS sale_week,
                COUNT(*) AS num_sales,
                ROUND(SUM(amount), 2) AS total_revenue,
                ROUND(AVG(amount), 2) AS avg_sale,
                ROUND(SUM(amount) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
             FROM raw_sales
             WHERE amount > 0
               AND product_category IS NOT NULL
             GROUP BY 1, 2, 3
             ORDER BY sale_week, total_revenue DESC",
        )
        .await?;

    // Load: write optimized Parquet
    transformed
        .write_parquet(
            "output/weekly_sales_summary.parquet",
            DataFrameWriteOptions::default(),
            None,
        )
        .await?;

    let elapsed = start.elapsed();
    println!("ETL complete in {:.2}s", elapsed.as_secs_f64());

    Ok(())
}

This compiles to a single binary. No Python environment, no JVM, no Docker image with 47 dependencies. Copy it to a server, run it. That simplicity matters in production.

Performance: Rust vs. Python for Common Data Engineering Tasks

I ran benchmarks on my actual pipeline workloads (M2 MacBook Pro, 32GB RAM). These are not micro-benchmarks — they're realistic data engineering operations on real-world-sized datasets.

Task Python (pandas) Python (Polars) Rust (native) Dataset Size
CSV parse + filter 34.2s 3.1s 1.8s 5GB, 80M rows
GroupBy aggregation 28.7s 2.4s 1.1s 50M rows, 10K groups
Join two tables 41.3s 4.8s 2.9s 20M x 5M rows
Parquet read + transform 18.6s 2.0s 1.2s 3GB Parquet
String parsing (logs) 22min 4.1min 1.5min 50M log lines
Peak memory (50M row agg) 12.4GB 3.8GB 2.1GB

The important column here is Python (Polars), not Rust (native). Polars in Python gets you 80-90% of native Rust performance with zero Rust knowledge. The remaining 10-20% only matters if you're running this hundreds of times per day or operating under strict latency constraints.

The biggest performance win isn't Rust vs. Python — it's moving from pandas to Polars, regardless of language. If you haven't tried Polars yet, start there. You don't need Rust to benefit from the Rust data ecosystem.

The Learning Curve: An Honest Reality Check

I won't sugarcoat it. Learning Rust as a data engineer is harder than learning Go, harder than learning Scala, and significantly harder than picking up another Python framework. Here's what tripped me up:

The Borrow Checker (Weeks 1-4)

The first month was painful. Every other line had a compiler error about lifetimes, borrowing, or moves. The compiler messages are excellent — Rust has the best error messages of any language I've used — but understanding why I couldn't do something took time. The conceptual shift from "variables are labels for objects" (Python) to "variables own data and that ownership is tracked" (Rust) is real.

The Type System (Weeks 2-6)

Rust's type system is more expressive than anything in the Python world. Generics, traits, associated types, lifetime annotations — it's a lot. For data engineering specifically, you'll encounter Arc<dyn Array> and Box<dyn ExecutionPlan> patterns constantly in the Arrow/DataFusion ecosystem. Understanding trait objects and dynamic dispatch takes practice.

The Async Runtime (Week 3+)

DataFusion and most I/O-heavy Rust code uses async/await with the Tokio runtime. If you've used Python's asyncio, the concepts are familiar but the implementation is different. Pinning, Send bounds, and 'static lifetimes in async contexts will confuse you. It confused me for weeks.

When It Clicks (Month 2-3)

Around month two, something shifted. The compiler went from adversary to pair programmer. When my code compiled, it usually worked correctly the first time. Not "works but has edge case bugs" — actually works. That experience is rare in programming and genuinely addictive.

My practical advice: spend your first two weeks on The Rust Book and Rustlings exercises. Don't try to build data pipelines on day one. Get comfortable with ownership, borrowing, and pattern matching first. Then move to the data crates.

When Rust Is Overkill

I want to be direct about this because the Rust community sometimes glosses over it. Rust is overkill when:

  • Your pipeline runs once a day and finishes in 5 minutes. Rewriting it in Rust to make it run in 30 seconds doesn't matter if it's a scheduled batch job.
  • Your bottleneck is I/O, not compute. If you're waiting on API calls, database queries, or S3 downloads, Rust won't help. async Rust is good, but so is Python's asyncio.
  • Your team is all Python and nobody wants to learn Rust. A pipeline your team can maintain beats a faster pipeline nobody can debug.
  • You're doing exploratory data analysis. Jupyter notebooks exist for a reason. The compile-run-inspect loop in Rust is too slow for exploration.
  • You're gluing APIs together. Orchestration code — "call this API, check the result, trigger that job" — doesn't benefit from Rust's strengths.

The sweet spot for Rust in data engineering is compute-heavy transforms on large data, latency-sensitive streaming, and reusable tools/libraries. Everything else is better served by Python with Rust-powered libraries underneath.

What's Coming Next

The Rust data ecosystem is accelerating. A few things I'm watching:

  1. GlueSQL and other embeddable SQL engines are maturing, giving more options beyond DataFusion.
  2. Arrow Flight SQL is enabling Rust-based services to serve as high-performance query endpoints for BI tools.
  3. Iceberg-rs is in active development — a native Rust implementation of Apache Iceberg. Combined with delta-rs, this means you'll be able to interact with all major table formats from Rust without a JVM.
  4. The PyO3 ecosystem is making it easier to publish Rust-powered Python packages. Maturin, the build tool, now handles cross-compilation and wheel publishing seamlessly.
  5. WASM targets mean some of these tools could run in the browser. Imagine running DataFusion queries on Parquet files directly in a data catalog UI.

Getting Started: A Practical Path

If you're a data engineer curious about Rust, here's the path I'd recommend:

  1. Week 1-2: Work through The Rust Book chapters 1-10. Do the Rustlings exercises. Don't skip ownership and borrowing.
  2. Week 3: Install Polars (cargo add polars) and rewrite a simple pandas script. Compare the results and performance.
  3. Week 4: Try DataFusion. Load a Parquet file, run SQL queries. This is where it starts to feel powerful.
  4. Week 5-6: Build a small CLI tool with clap that does a real task from your work — maybe a data validator or a file format converter.
  5. Week 7+: Try PyO3. Take the slowest function in one of your Python pipelines and rewrite it as a Rust extension. Measure the improvement.

You don't need to go all-in. Even knowing enough Rust to read the source code of Polars or delta-rs makes you a better data engineer — you understand what's happening under the hood when you use these tools from Python.

Final Thoughts

Rust isn't replacing Python for data engineering. It's replacing the C/C++/Java layer underneath Python. The future looks like Python on top for expressiveness and Rust underneath for performance — and tools like Polars, DataFusion, and delta-rs are proving that model works exceptionally well.

What surprised me most wasn't the performance (I expected that). It was how much I enjoyed writing Rust once I got past the initial learning curve. The compiler catches entire categories of bugs that would have shown up at 3 AM in production. The type system makes refactoring fearless. And shipping a single binary that just works on any Linux box, with no virtualenv or dependency resolution, feels like a superpower after years of Python deployment headaches.

Start with Polars from Python. If you want more, learn enough Rust to build a small tool. If that hooks you, go deeper. The ecosystem will only get better from here.

Leave a Comment