Snowflake Procedures & Tasks: The Reliable Automation Backbone for Your Data Platform

Hook: You’ve got hourly ingestions, daily dimensional upserts, and a weekly backfill that only runs if yesterday’s load succeeded. If you’re still wiring this together in your orchestrator only, you’re leaving reliability (and visibility) on the table. In Snowflake, procedures do the work and tasks schedule and chain it—right where the data lives.


Why this matters (fast context)

  • Closer to the data = fewer moving parts. No flaky network hops, no external runners just to execute DML.
  • Built-in DAGs and lineage in Snowflake. You can schedule, chain, and observe with native primitives.
  • Polyglot execution. Use SQL (Snowflake Scripting), JavaScript, or Python (Snowpark) for the best-fit job.
  • Idempotent, transactional ELT. Wrap units of work with atomicity and retries.

Core concepts at a glance

  • Stored Procedures: Reusable functions that can contain control flow, error handling, and DML/DDL.
    • Languages: SQL (Snowflake Scripting), JavaScript, Python (Snowpark).
    • Execution rights: CALLER'S RIGHTS vs OWNER'S RIGHTS matters for privileges and object access.
  • Tasks: Schedules (CRON or time interval) and DAG chaining via AFTER. Each task runs a single SQL statement (often CALL my_proc(...)) using a warehouse or serverless compute.

Architecture: How Procedures and Tasks fit together

  1. Ingestion lands raw data (stages/files → external/internal tables).
  2. Stream tracks changes (insert-only or CDC).
  3. Procedure does a targeted merge/upsert into curated tables, with validation and metrics.
  4. Task schedules that procedure and chains dependencies: load_raw → transform_dim → transform_fact → quality_gate → publish.

Visually, think of a Snowflake-native DAG: each node is a TASK calling a PROCEDURE. Success/failure is recorded for audit and re-runs.


When to choose which procedure language

CriterionSQL (Snowflake Scripting)JavaScriptPython (Snowpark)
Primary workloadDDL/DML-heavy, set-based opsMixed control-flow + SQLDataFrames, UDFs, Python libs
Learning curveEasiest for SQL folksModerateModerate (package mgmt)
Perf profileGreat for set-based SQLGreat for orchestrating SQLGreat for data transforms in DF API
External depsNoneNone (usually)Requires PACKAGES, version mgmt
Use-casesMERGE, loops, error handling in SQLMulti-step jobs, conditional SQLComplex transforms, ML prep, feature eng

Rule of thumb: If it’s mostly set-based SQL, use SQL procedures. If you need flexible control flow around SQL, JavaScript is fine. If you want DataFrame APIs or Python libs inside Snowflake, use Snowpark Python.


Concrete examples

1) Snowflake Scripting Procedure (SQL) — Idempotent upsert with quality checks

CREATE OR REPLACE PROCEDURE DW.LOAD_DIM_CUSTOMER()
RETURNS VARCHAR
LANGUAGE SQL
EXECUTE AS OWNER
AS
$$
DECLARE
  v_rows_loaded NUMBER := 0;
BEGIN
  -- Upsert from stream into dimension
  MERGE INTO DW.DIM_CUSTOMER d
  USING (
    SELECT * FROM STG.CUSTOMER_STREAM  -- stream on raw/customer
  ) s
  ON d.CUSTOMER_ID = s.CUSTOMER_ID
  WHEN MATCHED THEN UPDATE SET
    EMAIL = s.EMAIL,
    UPDATED_AT = CURRENT_TIMESTAMP()
  WHEN NOT MATCHED THEN INSERT (
    CUSTOMER_ID, EMAIL, CREATED_AT, UPDATED_AT
  ) VALUES (
    s.CUSTOMER_ID, s.EMAIL, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
  );

  v_rows_loaded := SQLROWCOUNT;

  -- Simple quality gate
  IF v_rows_loaded = 0 THEN
    RETURN 'No changes found — nothing to load.';
  END IF;

  -- Record a metric for observability
  INSERT INTO DW.METRICS_LOAD_AUDIT(JOB_NAME, ROWS_AFFECTED, RUNTIME_TS)
  VALUES ('LOAD_DIM_CUSTOMER', v_rows_loaded, CURRENT_TIMESTAMP());

  RETURN 'Loaded rows: ' || v_rows_loaded;
EXCEPTION
  WHEN OTHER THEN
    INSERT INTO DW.METRICS_LOAD_AUDIT(JOB_NAME, ERROR, RUNTIME_TS)
    VALUES ('LOAD_DIM_CUSTOMER', :SQLERRM, CURRENT_TIMESTAMP());
    RAISE;
END;
$$;

2) JavaScript Procedure — Orchestrate conditional logic

CREATE OR REPLACE PROCEDURE DW.RUN_FACT_LOAD()
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS OWNER
AS
$$
var sqlCmd;

sqlCmd = "CALL DW.LOAD_DIM_CUSTOMER()";
snowflake.execute({sqlText: sqlCmd});

sqlCmd = "SELECT COUNT(*) AS C FROM STG.SALES_STREAM";
var rs = snowflake.execute({sqlText: sqlCmd});
rs.next();
var pending = rs.getColumnValue('C');

if (pending > 0) {
  snowflake.execute({sqlText: "CALL DW.LOAD_FACT_SALES()"});
  return "FACT load executed. Pending rows: " + pending;
} else {
  return "No sales changes; FACT load skipped.";
}
$$;

3) Snowpark Python Procedure — DataFrame transform with package pinning

CREATE OR REPLACE PROCEDURE DW.AGG_SESSIONS_DAILY()
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('snowflake-snowpark-python')
EXECUTE AS OWNER
AS
$$
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as ssum

def main(session: Session) -> str:
    df = session.table("RAW.SESSIONS")
    agg = (df.group_by(col("event_date"))
             .agg(ssum(col("duration")).alias("total_duration")))
    agg.write.save_as_table("DW.SESSIONS_DAILY", mode="overwrite")
    return "Aggregated daily sessions."
$$;

Scheduling & chaining with Tasks

Basic time-based schedule (CRON)

CREATE OR REPLACE TASK DW.TASK_LOAD_DIM_CUSTOMER
  WAREHOUSE = ETL_WH
  SCHEDULE = 'USING CRON 0 * * * * UTC'  -- top of each hour
AS
  CALL DW.LOAD_DIM_CUSTOMER();

ALTER TASK DW.TASK_LOAD_DIM_CUSTOMER RESUME;

Build a DAG with AFTER

CREATE OR REPLACE TASK DW.TASK_RUN_FACT_LOAD
  WAREHOUSE = ETL_WH
  AFTER DW.TASK_LOAD_DIM_CUSTOMER
AS
  CALL DW.RUN_FACT_LOAD();

-- Enable the chain
ALTER TASK DW.TASK_LOAD_DIM_CUSTOMER RESUME;
ALTER TASK DW.TASK_RUN_FACT_LOAD RESUME;

Serverless Tasks (no warehouse selection)

CREATE OR REPLACE TASK DW.TASK_AGG_SESSIONS_DAILY
  SCHEDULE = 'USING CRON 5 2 * * * UTC'
  USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
AS
  CALL DW.AGG_SESSIONS_DAILY();

Notes

  • Use serverless when you don’t want to manage a warehouse; Snowflake bills per-second for compute used.
  • For DAG enable/disable, you can resume/suspend root + children or use helper system procedures (e.g., enable/disable dependents) if available in your account.

Observability & operations

  • Run history
    • SNOWFLAKE.ACCOUNT_USAGE.TASK_HISTORY (or INFORMATION_SCHEMA.TASK_HISTORY) for executions, states, error messages.
    • Procedure calls appear in QUERY_HISTORY with text like CALL DW.PROC(...).
  • Manual re-run
    • EXECUTE TASK (for immediate run) or CALL the procedure directly.
  • Pausing the DAG
    • ALTER TASK ... SUSPEND; (and RESUME;). Suspend the root to halt downstream tasks.
  • Least surprise
    • Configure NOTIFY_ON_FAILURE via alerts or integrate with external observability (webhooks, Slack, etc.) using your orchestrator.

Best practices (what actually works in production)

Architecture

  • Thin tasks, fat procedures. Keep tasks as simple callers; put logic in procedures where you can test and reuse.
  • Small transactional units. Prefer multiple targeted procedures over one giant “do everything” job; easier retries and lower blast radius.
  • Idempotency first. Use Streams or natural keys to ensure re-runs don’t double-count. Write MERGEs, not blind INSERTs.
  • Use OWNER'S RIGHTS for platform-owned pipelines to avoid per-caller grant headaches. For self-service utilities, use CALLER'S RIGHTS.

Performance & cost

  • Choose warehouse sizes intentionally; many short tasks → smaller size + auto-suspend.
  • For spiky schedules, serverless tasks simplify ops and can be cheaper.
  • Batch small steps into one procedure call to amortize startup overhead.

Reliability

  • Add row-count assertions and write to an audit table per job (input rows, affected rows, duration, watermark).
  • Put quality gates (e.g., “fail if < X rows on weekday”) to catch silent upstream issues.
  • Avoid circular dependencies in AFTER chains; keep DAGs shallow and legible.

Security

  • Separate execution roles (tasks/procs) from developer roles.
  • Use EXECUTE AS OWNER + dedicated role to centralize privileges.
  • Keep secrets in Snowflake secrets management; don’t embed credentials in code.

Deployability

  • Treat procedures as code: version them, review PRs, run integration tests that CALL procs in a dev schema.
  • Pin Snowpark RUNTIME_VERSION and PACKAGES for reproducibility.
  • Use CREATE OR REPLACE with care in prod; a two-step deploy (new name → cutover) can avoid breaking calls.

Common pitfalls

  • Warehouse misconfiguration (or none for non-serverless tasks) → task never runs.
  • Wrong rights mode → proc works in dev but fails in prod for other roles.
  • Non-idempotent loads → duplicates on retries.
  • Long transactions → lock contention and timeouts; chunk large merges.
  • Hidden failures when tasks are suspended after errors; always check history or alerting.

A simple, production-ready DAG you can copy

-- 1) Root task: hourly dimension load
CREATE OR REPLACE TASK DW.T_DIM_CUSTOMER
  WAREHOUSE = ETL_WH
  SCHEDULE = 'USING CRON 0 * * * * UTC'
AS
  CALL DW.LOAD_DIM_CUSTOMER();

-- 2) Downstream fact load only if dim succeeded
CREATE OR REPLACE TASK DW.T_FACT_SALES
  WAREHOUSE = ETL_WH
  AFTER DW.T_DIM_CUSTOMER
AS
  CALL DW.RUN_FACT_LOAD();

-- 3) Daily aggregates via serverless
CREATE OR REPLACE TASK DW.T_AGG_SESSIONS_DAILY
  AFTER DW.T_FACT_SALES
  USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
AS
  CALL DW.AGG_SESSIONS_DAILY();

ALTER TASK DW.T_DIM_CUSTOMER RESUME;
ALTER TASK DW.T_FACT_SALES RESUME;
ALTER TASK DW.T_AGG_SESSIONS_DAILY RESUME;

Internal link ideas (for your site)

  • Streams & Change Data Capture in Snowflake (how to build append-only patterns)
  • Snowpark Python: Packaging, UDFs, and Stored Procedures
  • Data Quality Gates with SQL: Assertions & Audit Tables
  • Serverless vs Warehouse-Backed Tasks: Cost & Performance
  • Designing Idempotent MERGE Patterns for ELT

Summary & call to action

Bottom line: Put your routine ELT where your data lives. Use procedures to encapsulate idempotent, testable logic and tasks to schedule and chain it into DAGs. You’ll get fewer gremlins, faster recovery, and better observability—without duct-taping external runners.

Want a quick review of your current Snowflake DAG or help converting Python-only jobs into Snowpark procedures with serverless tasks? Share your schema and run cadence—I’ll critique it ruthlessly and propose a hardened design.


Image prompt

“A clean, modern data architecture diagram showing a Snowflake-native DAG: tasks (nodes) triggering stored procedures across raw → stage → dim → fact layers, with Streams feeding changes and audit tables capturing metrics — minimalistic, high contrast, isometric style.”


Tags

#Snowflake #DataEngineering #ELT #Snowpark #StoredProcedures #Tasks #DataPipelines #SQL #Scalability #Reliability