Dagster Chapter 1 — Hello Assets: CSV → DuckDB

This is where we start.
You’ll see what Dagster means by an asset, how to run it locally, and why this approach feels cleaner than a pile of ad-hoc scripts.

We’ll use one small CSV file, turn it into a few tidy DataFrames, and save the result into a DuckDB file. Everything runs in Docker so your environment stays reproducible.

What You’ll Learn

What Dagster assets are and how they describe data flow.
How to run Dagster locally in Docker.
How to materialize a simple graph and check the result in DuckDB.
How to run a test end-to-end.

Why Start with Assets

In Dagster, an asset is a data product with a name, lineage, and history.
Think of it as a file, table, or dataset that Dagster knows how to build and track.

Instead of a script that just “runs,” an asset says, “Here’s what I produce, here’s what I depend on, and here’s where the result lives.”

When you build with assets, you automatically get:

Lineage — Dagster knows which asset depends on which.
Metadata — every run records rows, columns, and timestamps.
Observability — logs, previews, and history appear in the UI.
Selective runs — you can rebuild only what changed.

Our first graph looks like this:

graph TD
  A[raw_iris] --> B[iris_clean] --> C[iris_summary]

Project Layout

chapter-01-hello-assets/
  data/
    raw/iris.csv
    warehouse/
  docker/
    Dockerfile
    docker-compose.yaml
    dagster_home/dagster.yaml
    scripts/start.sh
    scripts/stop.sh
  src/pipelines_the_right_way/ch01/
    assets.py
    defs.py
  tests/test_assets.py
  Makefile
  requirements.txt

Each folder serves a clear purpose:

data/ holds the CSV and the DuckDB output.
docker/ defines a light container image and turns off telemetry.
src/ keeps the Dagster code.
tests/ includes a small test that materializes everything once.
Makefile wraps common commands.

Run It

From inside the project folder:

bash docker/scripts/start.sh

Then open http://localhost:3000.
Click Materialize all. Dagster will run three assets:

raw_iris reads data/raw/iris.csv and emits metadata.
iris_clean checks column types and drops nulls.
iris_summary groups data by species and writes a DuckDB file in data/warehouse/iris.duckdb.

When you’re done:

bash docker/scripts/stop.sh

Check the result manually:

duckdb data/warehouse/iris.duckdb
D SELECT * FROM iris_summary;

How the Code Works

`assets.py`

Each asset is a plain Python function decorated with @asset.

To keep paths flexible, we compute them at runtime:

def _paths():
    data_dir = Path(os.getenv("PTWR_DATA_DIR", ROOT / "data"))
    raw_csv = data_dir / "raw" / "iris.csv"
    warehouse_dir = data_dir / "warehouse"
    duckdb_path = warehouse_dir / "iris.duckdb"
    return data_dir, raw_csv, warehouse_dir, duckdb_path

This avoids the usual “path not found” issues when tests or containers run in different environments.

raw_iris loads the CSV and emits metadata (row count, column names, preview).
iris_clean checks for missing columns, converts types, and removes nulls.
iris_summary groups by species and writes the result to DuckDB:

con = duckdb.connect(str(duckdb_path))
con.register("iris_summary_df", summary)
con.execute("CREATE OR REPLACE TABLE iris_summary AS SELECT * FROM iris_summary_df")

`defs.py`

Dagster needs to know what assets exist:

from dagster import Definitions
from .assets import raw_iris, iris_clean, iris_summary

defs = Definitions(assets=[raw_iris, iris_clean, iris_summary])

When you run dagster dev -m pipelines_the_right_way.ch01.defs, Dagster loads this list and shows the graph in the UI.

Running in Docker

The Compose file mounts:

../data to /app/data
./dagster_home to /app/.dagster

That keeps your data and instance config outside the container image.

Telemetry is off by default, so the setup stays clean.

Testing

Run all tests with:

make test

The test creates a temporary data folder, sets environment variables before importing assets, copies the sample CSV, and materializes the graph.
It checks that the DuckDB file exists and that iris_summary has rows.

This pattern guarantees reproducible results without touching your real data/ directory.

Common Issues and Fixes

ImportError: attempted relative import with no known parent package
→ Run with dagster dev -m pipelines_the_right_way.ch01.defs.
No instance configuration file warning
→ We already mount docker/dagster_home/dagster.yaml.
No output file created
→ Make sure you materialized all assets; check the data/ mount in Docker.
Missing columns in CSV
→ iris_clean validates headers and fails clearly if they don’t match.

Observability in the UI

Click each asset to see:

metadata you emitted,
row counts and previews,
compute kind (“pandas” or “duckdb”),
the dependency graph connecting all assets.

This visibility is why assets are worth using even for small projects.

Production-Ready Touches

Even this simple setup follows good habits:

Pinned dependencies for reproducibility.
Dockerized environment — no local Python mess.
Tests that cover the full pipeline.
Idempotent DuckDB writes (CREATE OR REPLACE TABLE).

Looking Ahead

In the next chapter, we’ll replace a static CSV with a live API and see how assets handle real data and retries.

By the end of this book, you’ll have learned how to scale from this small example to real-world, testable, observable data pipelines.

For now, if these boxes check out, you’re ready:

UI opens at http://localhost:3000.
Materialization creates data/warehouse/iris.duckdb.
SQL query shows three species.
Tests pass.

If everything works, congratulations — you’ve just built your first Dagster pipeline the right way.

(End of Chapter 1)

Official Dagster: https://dagster.io/

Github for Chapter 1: https://github.com/alexnews/pipelines-the-right-way/tree/main/chapter-01-hello-assets

Data/ML Engineer Blog