DuckDB – Data/ML Engineer Blog

DuckDB for Data Engineers: The In-Process OLAP Engine That Makes Parquet Fly

Why DuckDB matters for mid-level data engineers

You’ve got datasets that are too big for Pandas but too small (or too local) to justify spinning up Spark. You want columnar speed, SQL power, and zero cluster babysitting. That’s DuckDB: an in-process OLAP database you embed in Python/R/Node/Java and point directly at Parquet/CSV/JSON—often without moving data at all. (DuckDB)

What DuckDB is (and how it’s different)

In-process OLAP, columnar & vectorized

DuckDB runs inside your process (no server to manage) and executes analytical SQL using a vectorized engine: operators work on fixed-size columnar batches (“vectors” and “data chunks”) for CPU-efficient scans, filters, joins, and aggregates. That’s why it feels “fast on a laptop.” (DuckDB)

Files as tables, with pushdown

DuckDB reads/writes Parquet efficiently and pushes projections/filters into the Parquet scan so you don’t pay to deserialize the whole file. It also supports CSV/JSON and can query files over HTTP/S or S3 via the httpfs extension. (DuckDB)

Zero-friction setup, wide client support

Use the CLI or import from Python, R, Node, Java, Go, ODBC, and more. Install is a single binary or pip package. (DuckDB)

🔎 Current status: DuckDB publishes frequent releases (e.g., 1.4.2 LTS on Nov 12, 2025). Check the homepage for the latest. (DuckDB)

Quick start: SQL & Python that feels like cheating

Query Parquet locally

-- Project only needed columns and push filters into the Parquet scan
SELECT user_id, sum(amount) AS gross
FROM read_parquet('data/sales/*.parquet')
WHERE order_ts >= DATE '2025-01-01'
GROUP BY ALL
ORDER BY gross DESC
LIMIT 20;

(Projection + filter pushdown occur automatically on Parquet.) (DuckDB)

Query Parquet over HTTPS or S3

INSTALL httpfs;  -- once per database
LOAD httpfs;

-- HTTP(S) Parquet (replacement scan lets you query by URL)
SELECT COUNT(*) 
FROM 'https://duckdb.org/data/holdings.parquet';

httpfs enables reading remote data; URLs ending with .parquet can be scanned directly. (DuckDB)

Python example: window analytics on Parquet

import duckdb

con = duckdb.connect()  # in-memory
con.execute("PRAGMA threads=8;")
con.execute("SET memory_limit='8GB';")

query = """
WITH s AS (
  SELECT customer_id, amount, order_ts
  FROM read_parquet('s3://my-bucket/sales/2025/*.parquet')
  WHERE order_ts >= DATE '2025-06-01'
)
SELECT
  customer_id,
  sum(amount)                           AS total_spend,
  avg(amount)                           AS avg_order,
  percentile_cont(0.5) WITHIN GROUP (ORDER BY amount) AS p50_order,
  sum(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_ts
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )                                     AS running_spend
FROM s
GROUP BY ALL
ORDER BY total_spend DESC
LIMIT 50;
"""
print(con.execute(query).fetch_df().head())

Use PRAGMAs/config to control threads and memory at runtime. (DuckDB)

Architecture explained (clear & concise)

Columnar storage & vectors: DuckDB processes data in vectors (default batch size) grouped as DataChunks, enabling cache-friendly execution and SIMD-friendly operators. (DuckDB)
Vectorized operators: Scans, filters, joins, and aggregates act on vectors instead of row-at-a-time, lowering CPU overhead. (DuckDB)
Pushdown to file formats: For Parquet, DuckDB prunes columns and pushes predicates, minimizing IO and deserialization. (DuckDB)
Extensions: httpfs adds HTTP/S + S3 access; other core extensions cover formats and functions. (DuckDB)

Practical tuning for “medium data” (1–100 GB)

Parallelism: PRAGMA threads=N; to match CPU cores for scans/aggregates/joins. (DuckDB)
Memory guardrails: SET memory_limit='8GB'; to keep queries in check in containers/CI. (DuckDB)
File layout: Prefer fewer, larger Parquet files with sensible row group sizes; extremely tiny files and tiny row groups hurt throughput. (DuckDB)
Column pruning: SELECT only the columns you need—DuckDB will push that down to Parquet. (DuckDB)
Remote access: Install & load httpfs when reading from S3/HTTPS; URLs with .parquet can be queried directly (replacement scans). (DuckDB)

Common pitfalls (and how to avoid them)

“Too many small files” syndrome: Consolidate Parquet shards to reduce metadata and open/seek overhead. (DuckDB)
Forgetting to cap resources: Long pipelines inside notebooks/CI? Set threads and memory_limit explicitly. (DuckDB)
Pulling whole CSVs over HTTP: CSV is row-based—partial reads are limited; consider staging or converting to Parquet. (DuckDB)
Assuming server features: DuckDB is in-process—don’t expect external auth/session pooling like a dedicated DB server. (DuckDB)

When to choose DuckDB vs. alternatives

Scenario / Need	DuckDB (in-process OLAP)	Pandas	Spark / Dask (distributed)	SQLite (in-process OLTP)
Ad-hoc analytics on Parquet/CSV (1–100 GB)	Excellent	OK (memory bound)	Overhead vs. scale	Not a fit
SQL window/OLAP functions locally	Excellent	Limited (Pythonic)	Excellent	Limited
Reads over HTTP/S3 without ETL	Built-in via httpfs	Plugins, manual	Yes (connector)	No
Operational workloads / row-at-a-time	Not ideal	Not ideal	Not ideal	Strong
Zero server, easy packaging	Yes	Yes	No	Yes

(“Excellent/OK” are qualitative; pick based on data volume, team skills, and operational context.) (DuckDB)

Security & enterprise notes

Parquet encryption: DuckDB can read/write encrypted Parquet (modular encryption) with session-scoped keys. (DuckDB)
Operations guidance: See the Operations Manual for deployment advice in mission-critical setups. (DuckDB)

Internal link ideas (for your site)

“Columnar vs. Row-based Storage Explained” (tie into DuckDB’s vectorized engine)
“Parquet 101: Row Groups, Compression, and Predicate Pushdown”
“Designing Local Analytics Workstations: CPU, RAM, and NVMe for DuckDB”
“From Notebook to Prod: Packaging DuckDB Queries in CI/CD”

External references (official docs only)

DuckDB homepage & latest release notes. (DuckDB)
Parquet I/O & tips (pushdown, row groups). (DuckDB)
httpfs extension (HTTP/S + S3, replacement scans). (DuckDB)
Pragmas (threads, memory limits). (DuckDB)
Execution internals (vectors, data chunks). (DuckDB)
Operations Manual. (DuckDB)

Summary & call-to-action

DuckDB gives you OLAP-grade SQL and columnar speed right inside your code. For mid-sized analytics on Parquet/CSV—especially ad-hoc exploration, feature engineering, and CI-friendly checks—it’s hard to beat the install-and-go simplicity. Start with a local Parquet dataset, set threads and memory_limit, and measure. If you outgrow your laptop, you can still graduate to distributed engines later—without throwing away your SQL.

Try it today: point DuckDB at a Parquet folder, run two queries (local vs. HTTP/S3), and compare runtime and IO counters. Then codify that workflow in a reusable Python function.

Image prompt (for generation tools)

“A clean, modern data architecture diagram of DuckDB in an analyst’s laptop: local Parquet/CSV on NVMe, optional HTTP/S3 via httpfs, vectorized execution blocks (vectors/data chunks), and SQL queries in a notebook. Minimalistic, high-contrast, isometric 3D style.”

Pitch ideas (pick your next article)

“DuckDB vs. Spark vs. Pandas: Choosing the Right Engine for 10–200 GB Analytics” — intent: comparative, decision-making.
“Parquet Performance Tuning with DuckDB: Row Groups, Compression, and Predicate Pushdown” — intent: technical how-to. (DuckDB)
“Streaming-Adjacent: Efficient Incremental Batch with DuckDB on S3 via httpfs” — intent: practical patterns. (DuckDB)
“Inside DuckDB’s Vectorized Engine: What Vectors & DataChunks Mean for Your Queries” — intent: educational internals. (DuckDB)
“Security Primer: Working with Encrypted Parquet in DuckDB” — intent: enterprise readiness. (DuckDB)
“From Notebook to CI: Testing Data Quality with Pure SQL in DuckDB” — intent: DevOps/DataOps.
“Sane Defaults: Threading, Memory Limits, and Resource Control in DuckDB” — intent: ops guidance. (DuckDB)

Data/ML Engineer Blog