Streams, Tasks, and Dynamic Tables: The Modern Snowflake ELT Playbook
Build incremental pipelines with zero cron jobs
Introduction — why this matters
Cron-driven data jobs rot. They overrun, collide, and lie when they say “green.” Snowflake gives you three primitives that kill most cron scripts: Streams for change capture, Tasks for dependency-aware orchestration, and Dynamic Tables for continuously maintained transformations. Put them together and you get incremental ELT that’s event-driven, debuggable, and cost-aware—without a farm of brittle schedulers.
This guide shows the end-to-end pattern: ingest → raw → bronze/silver/gold using Streams+Tasks or Dynamic Tables, with no cron math, only data-driven triggers and target lags.
The mental model
- Streams = lightweight CDC over a table/view (the “delta feed”).
- Tasks = serverless schedulers that chain via
AFTERandWHEN(run only when there’s work). - Dynamic Tables (DTs) = continuously maintained tables with a
TARGET_LAG(how fresh you want it) that propagate changes through a DAG.
Think: Streams for ingest deltas, Tasks for imperative merges/ops, Dynamic Tables for declarative transforms.
Architecture at a glance
Landing → Raw
- Files arrive in an external/internal stage (e.g., S3). Use Snowpipe or
COPY INTOto load RAW_ tables.
Incremental Transform
Two idioms—pick one per layer:
- Streams + Tasks (explicit control, great for upserts/SCD):
- A stream over RAW captures new/changed rows.
- A task fires only when the stream has data and runs a
MERGE. - Downstream tasks chain with
AFTER.
- Dynamic Tables (declarative, graph-managed):
- Define DTs for BRONZE/SILVER/GOLD with
TARGET_LAGandWAREHOUSE. - Snowflake incrementally refreshes them; downstream DTs update automatically.
You can mix them: Streams+Tasks for stateful upserts, DTs for stateless transforms/joins.
Quick comparison
| Feature | Streams + Tasks | Dynamic Tables |
|---|---|---|
| Control | Fine-grained, imperative SQL | Declarative; Snowflake manages refresh |
| Latency | Task cadence / event-driven | TARGET_LAG (e.g., 1 minute) |
| Best for | Upserts, SCD, idempotent merges, audit trails | Joins, filters, rollups, materialized views of logic |
| Orchestration | AFTER chain; WHEN guards | Automatic DAG from DT dependencies |
| Cost tuning | Choose warehouse per task; sleep if no work | Choose warehouse per DT; Snowflake prunes work |
| Pitfalls | Stream consumption semantics; multiple readers | Long join chains; “hidden” compute via cascades |
Real example: Ingest → Upsert (Streams + Tasks, zero cron)
1) Raw load (Snowpipe or COPY)
-- Raw landing table
create or replace table raw_orders (
order_id varchar,
customer_id varchar,
ts timestamp_ntz,
amount number(12,2),
_load_file varchar,
_ingested_at timestamp_ntz default current_timestamp()
);
-- Example COPY (Snowpipe auto-ingest recommended in production)
copy into raw_orders
from @ingest_stage/orders/
file_format = (type = csv field_optionally_enclosed_by='"' skip_header=1)
pattern='.*\.csv';
2) Create a stream on raw
create or replace stream raw_orders_stream
on table raw_orders
append_only = false; -- capture inserts + updates + deletes
3) Upsert into bronze via a Task that runs only when there’s work
create or replace table bronze_orders like raw_orders;
create or replace task t_upsert_bronze_orders
warehouse = etl_xs
-- No schedule: we’ll “chain-trigger” from a lightweight root,
-- and guard with WHEN to avoid idle runs
when system$stream_has_data('RAW_ORDERS_STREAM')
as
merge into bronze_orders b
using (
select order_id, customer_id, ts, amount
from raw_orders_stream
) d
on b.order_id = d.order_id
when matched then update set
b.customer_id = d.customer_id,
b.ts = d.ts,
b.amount = d.amount
when not matched then insert (order_id, customer_id, ts, amount)
values (d.order_id, d.customer_id, d.ts, d.amount);
4) Minimal root “tick” task (can be very infrequent)
We still need some heartbeat to evaluate WHEN. Make it infrequent and cheap—no cron tuning, just a guard.
-- Fires every 5 minutes but does no work itself
create or replace task t_tick
schedule = '5 minute'
warehouse = etl_xs
as
call system$wait(1); -- a no-op; enables evaluation of downstream WHEN clauses
-- Chain the upsert after the tick
alter task t_upsert_bronze_orders add after t_tick;
-- Turn on the DAG
alter task t_tick resume;
alter task t_upsert_bronze_orders resume;
Why this is “zero cron jobs” in practice: you’re not hand-scheduling business logic. The tick just wakes the graph; WHEN system$stream_has_data ensures no compute if there’s no data. Make the tick 1, 5, or 15 minutes—whatever your freshness SLO needs.
Real example: Declarative pipeline (Dynamic Tables)
1) Bronze as a DT over raw
create or replace dynamic table dt_bronze_orders
target_lag = '1 minute'
warehouse = etl_xs
as
select
order_id,
customer_id,
ts,
amount
from raw_orders;
2) Silver: joins + cleaning
create or replace dynamic table dt_silver_orders
target_lag = '2 minutes'
warehouse = etl_small
as
select
b.order_id,
b.customer_id,
date_trunc('day', b.ts) as order_date,
coalesce(nullif(b.amount,0), 0) as amount
from dt_bronze_orders b;
3) Gold: daily aggregates
create or replace dynamic table dt_gold_daily_sales
target_lag = '5 minutes'
warehouse = etl_small
as
select
order_date,
sum(amount) as total_sales,
count(*) as orders
from dt_silver_orders
group by order_date;
Turn them on:
alter dynamic table dt_bronze_orders resume;
alter dynamic table dt_silver_orders resume;
alter dynamic table dt_gold_daily_sales resume;
Result: Snowflake maintains the DAG automatically; changes in raw_orders roll forward to gold within your TARGET_LAG windows.
Query patterns & indexing (Snowflake-style)
Snowflake doesn’t have user-managed indexes; performance hinges on:
- Clustering for large tables:
alter table bronze_orders cluster by (order_date, customer_id); - Micro-partition pruning: design partition-friendly keys (e.g., dates, tenants).
- Result caching and Warehouse right-sizing for bursty loads.
SCD Type 2 with Streams + Tasks (concise pattern)
-- Assume dim_customer_scd2(history table) and a stream on staging_customers
create or replace stream stg_customers_stream on table staging_customers;
create or replace task t_dim_customer_scd2
warehouse = etl_xs
when system$stream_has_data('STG_CUSTOMERS_STREAM')
as
merge into dim_customer_scd2 d
using (
select * from stg_customers_stream
) s
on d.customer_id = s.customer_id
and d.is_current = true
when matched and (hash(d.name, d.segment) <> hash(s.name, s.segment)) then
update set d.is_current = false, d.valid_to = current_timestamp()
when not matched then
insert (customer_id, name, segment, is_current, valid_from, valid_to)
values (s.customer_id, s.name, s.segment, true, current_timestamp(), null);
Monitoring & observability
TASK_HISTORY()&INFORMATION_SCHEMA.TASK_HISTORYfor outcomes/duration.SHOW DYNAMIC TABLESandSYSTEM$EXPLAIN_DYNAMIC_TABLE_REFRESH_PLANto inspect DT DAG and work.- Add telemetry tables: write a row per run with row counts from
SYSTEM$STREAM_HAS_DATA,RESULT_SCAN()diagnostics, andQUERY_HISTORY()links.
Cost control playbook
- Use tiny warehouses (
XS) for frequent, quick tasks; scale up only where joins/materializations demand. - Prefer
WHEN system$stream_has_dataand DTTARGET_LAGover tight schedules. - Cluster only the big boys, and recluster on demand, not reflexively.
- Keep stream retention aligned to replay windows (default 14 days; verify for your account).
- Avoid multiple consumers of the same stream; fan-out with tables or views, not duplicate reads.
Common pitfalls (and fixes)
- Consuming a stream twice → the second consumer sees nothing. Fix: Single consumer reads the stream, writes a staging table for others.
- Orphaned streams with retention expired → deltas lost. Fix: Monitor
STALE_AFTERmetrics; don’t pause upstream for long. - DT surprise compute from long chains → Fix: Set per-DT warehouses, widen
TARGET_LAGwhere freshness is overkill, break up expensive joins. - Non-idempotent merges → duplicates on retries. Fix: ensure deterministic
MERGEkeys and guard with de-dup staging CTEs. - Warehouse thrash (auto-suspend/auto-resume flapping) → Fix: tune suspend to 60–120s on chatty workloads.
Putting it together (reference blueprint)
Option A: Event-driven ELT with Streams + Tasks
- Snowpipe loads →
RAW_*. STREAMon raw →WHEN system$stream_has_datatasks.- Chain tasks with
AFTERinto bronze/silver/gold merges. - One tiny tick task wakes the graph; business logic only runs if there’s work.
Option B: Declarative ELT with Dynamic Tables
- Define DTs for bronze/silver/gold with explicit
TARGET_LAG. - Snowflake maintains DAG and incremental refresh.
- Use Streams+Tasks selectively where stateful upserts/SCD are required.
Internal link ideas (for your site)
- “External Tables & Snowpipe: Designing the Landing Zone”
- “SCD Type 2 Patterns in Snowflake (Streams vs. Dynamic Tables)”
- “Cost Tuning Snowflake Warehouses for ELT”
- “Idempotent SQL: Writing Safe MERGE Statements”
- “Observability: Query History, Task History, and DT Refresh Plans”
Conclusion + Takeaways
Ditch cron tinkering. Anchor your Snowflake pipelines to data and freshness targets:
- Streams expose precisely what changed.
- Tasks run only when work exists and chain cleanly.
- Dynamic Tables maintain transformation graphs to an SLO you pick.
You’ll ship faster, debug less, and keep compute proportional to actual changes.
Image prompt
“A clean, modern data architecture diagram showing Snowflake stages ingesting to RAW, a Stream feeding a Task that MERGEs to BRONZE, and a parallel path of Dynamic Tables (BRONZE → SILVER → GOLD) with target lags. Minimalistic, high contrast, 3D isometric style, labeled arrows, Snowflake logo accents.”
Tags
#Snowflake #DataEngineering #ELT #DynamicTables #Streams #Tasks #CDC #Scalability #Architecture #BestPractices








Leave a Reply