Key Takeaways
- DuckDB is an in-process OLAP database that runs directly inside your Python, R, or Node.js application with zero external dependencies.
- For datasets under ~100 GB, DuckDB consistently matches or outperforms cloud data warehouses like BigQuery and Redshift on analytical queries, at zero cost.
- It reads Parquet, CSV, and JSON natively, which means you can query your data lake files without loading them into a database first.
- DuckDB is not a replacement for distributed systems when you genuinely need multi-terabyte scale or concurrent write-heavy workloads.
- Adoption is accelerating fast: 2M+ monthly downloads on PyPI, integrated into dbt, Airflow, Dagster, and dozens of other tools in the modern data stack.
Why I Stopped Spinning Up BigQuery for Every Project
About eighteen months ago, I was building an analytics pipeline for a client that processed around 40 GB of event data daily. The classic playbook: land the data in GCS, load it into BigQuery, run transformations, serve dashboards. It worked, but the monthly bill was creeping toward $800 just for the warehouse, and every query had a 2-3 second cold start that made the feedback loop during development painfully slow.
A colleague mentioned DuckDB in passing. "It's like SQLite but for analytics," he said. I was skeptical. An embedded database handling tens of gigabytes of analytical queries? On a single machine? But I tried it on a Sunday afternoon, and by Monday morning I had migrated the entire development pipeline. The queries that took 3-4 seconds in BigQuery were completing in 200-400 milliseconds locally. No network round-trips, no cold starts, no bill at the end of the month.
That experience fundamentally changed how I think about data architecture. Not every workload needs a distributed cloud warehouse. In fact, most don't.
What Exactly Is DuckDB?
DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system. If you know SQLite, you already understand the deployment model: it runs inside your application process, stores data in a single file (or entirely in memory), and requires zero configuration or external services. But where SQLite is optimized for transactional (OLTP) workloads with row-oriented storage, DuckDB is built from the ground up for analytical queries using columnar storage and vectorized execution.
The project was started in 2018 by Mark Raasveldt and Hannes Muhleisen at CWI Amsterdam (the same research institute where MonetDB and later Vectorwise originated). It's open source under the MIT license, and the DuckDB Foundation ensures it stays that way.
Here's what makes it architecturally interesting:
- Columnar-vectorized execution engine: Processes data in batches of vectors rather than row-by-row, leveraging CPU cache lines and SIMD instructions.
- In-process execution: No client-server overhead. The database engine runs in the same process as your application.
- Zero external dependencies: A single
pip install duckdbgives you the full database. No Java, no JVM, no Docker container. - Parallel query execution: Automatically parallelizes queries across all available CPU cores.
- Direct file querying: Read from Parquet, CSV, JSON, and even remote files (S3, HTTP) without an import step.
Getting Started: A DuckDB Tutorial in 5 Minutes
Installation is almost comically simple compared to setting up a traditional warehouse:
pip install duckdb
That's it. No server process, no configuration files, no port management. Let's do something useful immediately:
import duckdb
# Create an in-memory database
con = duckdb.connect()
# Query a Parquet file directly — no loading step
result = con.sql("""
SELECT
date_trunc('month', event_timestamp) AS month,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM 'events_2024/*.parquet'
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC
""").fetchdf()
print(result.head(20))
Notice the glob pattern in the FROM clause. DuckDB natively reads multiple Parquet files, and it pushes down filters and projections to only read the columns and row groups it actually needs. If your Parquet files are partitioned by date, a WHERE clause on the date column will skip irrelevant files entirely.
Persistent Storage
For datasets you query repeatedly, you can persist to a DuckDB file:
import duckdb
# Create or open a persistent database
con = duckdb.connect('analytics.duckdb')
# Load data from Parquet into a persistent table
con.sql("""
CREATE TABLE IF NOT EXISTS events AS
SELECT * FROM 'raw_events/*.parquet'
""")
# Add an index for frequently filtered columns
con.sql("CREATE INDEX idx_events_date ON events(event_timestamp)")
# Now queries hit the optimized local storage
con.sql("""
SELECT event_type, COUNT(*)
FROM events
WHERE event_timestamp >= '2024-06-01'
GROUP BY 1
""").show()
The Pandas and Polars Integration
What surprised me most when I first used DuckDB was how seamlessly it works with DataFrames. You can query Pandas DataFrames directly with SQL, with zero copy overhead:
import pandas as pd
import duckdb
# Some DataFrame you already have in memory
df_orders = pd.DataFrame({
'order_id': range(1, 1000001),
'customer_id': [f'C{i % 5000}' for i in range(1000000)],
'amount': [round(50 + i * 0.01, 2) for i in range(1000000)],
'created_at': pd.date_range('2024-01-01', periods=1000000, freq='min')
})
# Query it with SQL — no import, no copy
result = duckdb.sql("""
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent,
AVG(amount) AS avg_order_value
FROM df_orders
WHERE created_at >= '2024-06-01'
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY total_spent DESC
LIMIT 20
""").fetchdf()
This is a genuine zero-copy operation. DuckDB reads the Pandas memory layout directly through the Apache Arrow interface. The same works with Polars DataFrames and Arrow tables.
Real-World Benchmarks: DuckDB vs BigQuery vs PostgreSQL
Numbers talk. I ran the same set of analytical queries across three systems using a realistic dataset: 50 million rows of e-commerce event data (~12 GB in Parquet, ~35 GB uncompressed CSV). The test machine for local runs was an M2 MacBook Pro with 16 GB RAM.
| Query Type | DuckDB (local) | BigQuery (on-demand) | PostgreSQL (local) |
|---|---|---|---|
| Simple aggregation (COUNT/GROUP BY) | 0.3s | 2.1s | 18.4s |
| Multi-table JOIN + aggregation | 1.2s | 3.8s | 45.2s |
| Window functions (RANK, LAG) | 0.8s | 4.2s | 52.1s |
| Full-text scan with LIKE | 1.5s | 2.9s | 67.3s |
| Parquet file scan (no pre-load) | 0.9s | N/A (requires load) | N/A (requires load) |
| Complex CTE with 3 subqueries | 2.1s | 5.6s | 78.9s |
The PostgreSQL numbers aren't a knock on Postgres — it's simply not designed for analytical workloads. Its row-oriented storage is perfect for transactional queries but terrible for column scans. The BigQuery numbers include network latency and cold start time, which is the reality of using a remote warehouse during development.
What I found particularly notable: DuckDB's performance on the Parquet file scan (no pre-loading into a table) was only marginally slower than querying its own persistent storage. The Parquet reader is extremely well optimized, using predicate pushdown and column pruning to minimize I/O.
DuckDB for Data Engineering Pipelines
Where DuckDB really shines isn't benchmarks on a laptop — it's the architectural simplicity it enables. Here's a pattern I've been using extensively in production pipelines:
The "Local-First ELT" Pattern
import duckdb
import boto3
from datetime import datetime, timedelta
def run_daily_transform(execution_date: str):
"""Daily ELT job that processes S3 data locally with DuckDB."""
con = duckdb.connect()
# Install and load the httpfs extension for S3 access
con.sql("INSTALL httpfs; LOAD httpfs;")
con.sql("""
SET s3_region = 'us-east-1';
SET s3_access_key_id = '${AWS_ACCESS_KEY_ID}';
SET s3_secret_access_key = '${AWS_SECRET_ACCESS_KEY}';
""")
# Read directly from S3 — only downloads needed columns/row groups
con.sql(f"""
CREATE TEMP TABLE raw_events AS
SELECT
event_id,
user_id,
event_type,
properties,
event_timestamp
FROM read_parquet('s3://data-lake/events/dt={execution_date}/*.parquet')
""")
# Run transformations locally — fast, free, no warehouse needed
con.sql("""
CREATE TEMP TABLE user_daily_summary AS
SELECT
user_id,
COUNT(*) FILTER (WHERE event_type = 'page_view') AS page_views,
COUNT(*) FILTER (WHERE event_type = 'purchase') AS purchases,
SUM(CAST(json_extract(properties, '$.amount') AS DECIMAL(10,2)))
FILTER (WHERE event_type = 'purchase') AS revenue,
MIN(event_timestamp) AS first_event,
MAX(event_timestamp) AS last_event,
COUNT(DISTINCT event_type) AS distinct_event_types
FROM raw_events
GROUP BY user_id
""")
# Write results back to S3 as Parquet
con.sql(f"""
COPY user_daily_summary
TO 's3://data-lake/aggregates/user_daily/dt={execution_date}/summary.parquet'
(FORMAT PARQUET, COMPRESSION 'zstd')
""")
row_count = con.sql("SELECT COUNT(*) FROM user_daily_summary").fetchone()[0]
print(f"Processed {row_count} user summaries for {execution_date}")
con.close()
This pattern eliminates the warehouse entirely for the transformation step. The S3 httpfs extension streams data in, DuckDB processes it locally using all available cores, and writes the result back. For datasets under 100 GB, this runs faster than equivalent BigQuery or Redshift jobs because there's no scheduler overhead, no slot contention, and no network round-trips during processing.
Integration with dbt
The dbt-duckdb adapter has matured significantly. You can run your dbt models against local Parquet files or a DuckDB database, which makes development and testing dramatically faster:
# profiles.yml
my_project:
target: dev
outputs:
dev:
type: duckdb
path: 'dev.duckdb'
extensions:
- httpfs
- parquet
prod:
type: bigquery
method: oauth
project: my-gcp-project
dataset: analytics
This gives you sub-second dbt model iteration during development, then the same SQL deploys to BigQuery in production. In my experience, this cuts the dbt development feedback loop from 10-15 seconds down to under a second.
Advanced Patterns Worth Knowing
Querying Remote Files Without Downloading
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
# Query a public Parquet file over HTTP — streams, doesn't download
result = con.sql("""
SELECT
tpep_pickup_datetime::DATE AS pickup_date,
COUNT(*) AS trip_count,
AVG(trip_distance) AS avg_distance,
AVG(total_amount) AS avg_fare
FROM read_parquet(
'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
)
GROUP BY 1
ORDER BY 1
""").fetchdf()
print(result)
Using DuckDB as a Data Quality Gate
One pattern I've found invaluable is using DuckDB for data quality checks in CI/CD pipelines:
import duckdb
import sys
def validate_data(parquet_path: str) -> bool:
"""Run data quality checks before promoting files to production."""
con = duckdb.connect()
checks = {
"no_null_ids": """
SELECT COUNT(*) FROM read_parquet('{path}')
WHERE user_id IS NULL
""",
"no_future_dates": """
SELECT COUNT(*) FROM read_parquet('{path}')
WHERE event_timestamp > CURRENT_TIMESTAMP + INTERVAL '1 hour'
""",
"amount_range": """
SELECT COUNT(*) FROM read_parquet('{path}')
WHERE amount < 0 OR amount > 100000
""",
"row_count_minimum": """
SELECT CASE WHEN COUNT(*) < 1000 THEN 1 ELSE 0 END
FROM read_parquet('{path}')
""",
}
all_passed = True
for check_name, query in checks.items():
result = con.sql(query.format(path=parquet_path)).fetchone()[0]
status = "PASS" if result == 0 else "FAIL"
print(f" {status}: {check_name} (violations: {result})")
if result > 0:
all_passed = False
return all_passed
if __name__ == "__main__":
passed = validate_data(sys.argv[1])
sys.exit(0 if passed else 1)
This runs in under a second for multi-gigabyte files and requires nothing except Python and DuckDB. No database server, no infrastructure. Plug it into a GitHub Action or Airflow task and you have lightweight data validation that scales surprisingly well.
Replacing pandas for Large-ish Data
If you've ever hit the wall where Pandas runs out of memory on a 5 GB CSV file, DuckDB is the escape hatch you've been looking for:
import duckdb
# This would OOM in pandas on a 16GB laptop
# In DuckDB it streams through disk-backed processing
result = duckdb.sql("""
SELECT
region,
product_category,
DATE_TRUNC('quarter', order_date) AS quarter,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM read_csv('massive_sales_data.csv',
auto_detect=true,
sample_size=100000)
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
""").fetchdf() # Result is small enough for pandas
DuckDB spills to disk when memory is tight, so it can process datasets significantly larger than RAM. The auto_detect parameter handles CSV schema inference, and sample_size controls how many rows it uses for type detection.
When to Use DuckDB
- Local analytics and EDA: Replace slow Pandas operations with SQL on Parquet/CSV files. Instant results on multi-GB data.
- Data pipeline development: Iterate on transformation logic locally before deploying to a cloud warehouse.
- CI/CD data validation: Run quality checks in pipelines without needing database infrastructure.
- Embedded analytics: Build analytical features directly into Python/Node.js/Rust applications without a separate database service.
- Small-to-medium batch processing: Daily jobs processing up to ~100 GB of data on a single machine.
- Data science notebooks: Query files and DataFrames with SQL in Jupyter, faster than Pandas for anything involving grouping or joining.
- Prototyping data models: Rapidly test schema designs and query patterns before committing to a warehouse.
When NOT to Use DuckDB
- Multi-terabyte datasets: DuckDB runs on a single machine. If your data genuinely doesn't fit on one node (even with spilling to disk), you need Spark, BigQuery, Snowflake, or similar distributed systems.
- High-concurrency serving: DuckDB supports multiple readers but only one writer at a time. If you need hundreds of concurrent analytical queries, a proper OLAP server (ClickHouse, Druid) is more appropriate.
- OLTP workloads: Frequent single-row inserts, updates, deletes — that's what PostgreSQL and MySQL are built for. DuckDB's columnar storage makes individual row operations expensive.
- Real-time streaming: DuckDB is a batch-oriented engine. For sub-second event processing, look at Flink, Kafka Streams, or Materialize.
- Multi-user warehouse with governance: If you need role-based access control, audit logs, query history, cost attribution — enterprise warehouses provide this; DuckDB doesn't.
DuckDB vs BigQuery: An Honest Comparison
Since "DuckDB vs BigQuery" is one of the most common questions I get, here's my honest assessment after using both extensively:
| Dimension | DuckDB | BigQuery |
|---|---|---|
| Setup time | 10 seconds (pip install) |
30+ minutes (project, IAM, billing) |
| Cost at 50 GB | $0 (runs on your laptop) | $50-200/month (storage + queries) |
| Query latency (warm) | 50-500ms | 1-5 seconds |
| Scale ceiling | ~100-200 GB practical | Petabytes |
| Concurrency | Single writer, multiple readers | Thousands of concurrent users |
| Ecosystem integration | Python, R, Node, Rust, Java, dbt | GCP ecosystem, Looker, dbt, Dataflow |
| Governance / RBAC | None | Enterprise-grade |
| Offline capability | Full (it's local) | None (cloud only) |
The takeaway isn't that one is better than the other. It's that they serve different stages of the data lifecycle. In my current workflow, DuckDB handles development, testing, and small batch jobs. BigQuery handles production workloads that need multi-tenant access and governance. Many teams are adopting exactly this hybrid approach.
The Ecosystem Is Growing Fast
DuckDB's extension system is one of its strongest features. Some extensions worth knowing about:
- httpfs: Query files directly from S3, GCS, Azure Blob, or any HTTP endpoint.
- postgres_scanner: Query PostgreSQL tables directly from DuckDB without exporting data.
- mysql_scanner: Same for MySQL databases.
- sqlite_scanner: Attach and query SQLite databases.
- spatial: GIS functions (ST_Point, ST_Distance, etc.) for geospatial analytics.
- iceberg: Read Apache Iceberg tables directly.
- delta: Read Delta Lake tables.
- json: Advanced JSON processing functions.
The postgres_scanner extension deserves special mention. You can query your production PostgreSQL database from DuckDB, which means you can run analytical queries against your OLTP database without impacting it — DuckDB pulls the data it needs and processes it locally:
import duckdb
con = duckdb.connect()
con.sql("INSTALL postgres; LOAD postgres;")
con.sql("""
ATTACH 'host=localhost dbname=myapp user=readonly'
AS pg (TYPE POSTGRES, READ_ONLY)
""")
# Run analytical queries against Postgres data, processed by DuckDB
result = con.sql("""
SELECT
DATE_TRUNC('week', pg.orders.created_at) AS week,
pg.products.category,
SUM(pg.orders.total) AS weekly_revenue
FROM pg.orders
JOIN pg.products ON orders.product_id = products.id
WHERE pg.orders.created_at >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, 2
""").fetchdf()
Things That Still Annoy Me
In the interest of honest assessment, here are the rough edges I still encounter:
- Single-writer limitation: If two processes try to write to the same DuckDB file simultaneously, one will fail. This means you need to be careful with concurrent Airflow tasks writing to the same database.
- Memory management can be opaque: DuckDB tries to stay within a configurable memory limit and spill to disk, but I've hit OOM situations on particularly complex queries with many intermediate results. Setting
SET memory_limit='8GB'explicitly helps. - No UPDATE/DELETE performance: Columnar storage means modifying individual rows is slow. If you need to update rows frequently, rethink the approach (append-only with deduplication works better).
- Extension version compatibility: Occasionally, upgrading DuckDB breaks extension compatibility. Pin your versions in production.
Where DuckDB Is Heading
The DuckDB Labs team has been shipping at a remarkable pace. Recent releases (0.10.x and 1.0.x in 2024) brought significant improvements: a stable storage format (finally — no more re-creating databases on upgrades), improved parallel hash joins, better out-of-core sorting, and the community extensions repository.
The broader trend is clear: the "embedded analytics database" category that DuckDB created is now a recognized part of the modern data stack. MotherDuck (a cloud service built on DuckDB) raised $52.5 million to build a hybrid local-cloud DuckDB experience. Tools like Evidence, Rill, and Malloy use DuckDB as their default analytical engine.
In my experience, we're moving toward a world where most data teams will use DuckDB as their default local analytical tool, the same way most developers use SQLite as their default embedded database. The cloud warehouse won't disappear, but it will become the exception rather than the default — reserved for workloads that genuinely need distributed compute and enterprise governance.
Getting Started Today
If you want to try DuckDB in your current workflow, here's the lowest-effort way to start:
- Install it:
pip install duckdb - Find a Parquet or CSV file you currently load into Pandas or a database for analysis.
- Query it directly:
duckdb.sql("SELECT * FROM 'your_file.parquet' LIMIT 10").show() - Compare performance against your current approach for a typical analytical query.
- If it's faster (it almost certainly will be), gradually adopt it for more of your local analytics work.
You don't need to rearchitect anything. DuckDB's greatest strength is that it meets you where you are — your existing files, your existing Python scripts, your existing notebooks. It just makes the analytical SQL part dramatically faster and simpler.
The best database is the one you don't have to manage. For a surprisingly large set of analytical workloads, that database is DuckDB.
Leave a Comment