Snowpark for Python in Production

Snowpark for Python in Production: UDFs/UDAFs, Secure ML, and When Not to Use It

Hook: Your data’s already in Snowflake, your team writes Python, and product wants “a quick ML score in the dashboard.” You’re tempted to stuff the logic into a Python UDF and call it a day. Sometimes that’s exactly right. Sometimes that’s how you inherit an un-monitorable, slow black box. This guide shows where Snowpark for Python shines in production—and where external compute is the better call.


Why this matters (Executive TL;DR)

  • Snowpark for Python lets you run Python close to the data: UDFs, UDTFs, UDAFs, stored procedures, and batch/ML pipelines without egress.
  • It’s ideal for data-in-place transformation, batch inference, and governed extensions of SQL.
  • It’s the wrong tool for heavy GPU training, latency-critical online inference, or bleeding-edge libraries not supported in your Snowflake runtime.
  • Production success = package discipline, governance (SECURE, tags, policies), observability, and a hard bias toward built-ins before Python.

Architecture & Concepts (clear mental model)

Where code runs: Snowpark Python executes inside Snowflake compute (warehouses). Your Python is sandboxed. Data never leaves unless you explicitly allow external access.

Core building blocks:

  • Scalar UDFs – row → single value (good for custom scoring, parsing).
  • UDTFs – row → set of rows (tokenize, explode).
  • UDAFs – many rows → single value (custom aggregates).
  • Procedures – orchestration and side-effecting tasks (DDL, job control).
  • Vectorized (pandas) UDFs – batch input → batch output; faster than per-row Python when supported.
  • Security & GovernanceSECURE objects, masking/row-access policies, object tags, external access integrations, secrets.

Golden rule: If Snowflake has a native function or SQL expression that solves it, use that first. UDFs restrict some optimizations (predicate pushdown, join reordering) and add a Python runtime boundary.


When to use Snowpark vs External Compute

Decision AxisChoose Snowpark when…Choose External Compute (Databricks/SageMaker/Ray/etc.) when…
Data GravityData is large and already in Snowflake; egress is costly/slow.Data spans multiple lakes/systems; you need cross-platform DSP/ETL.
Workload TypeBatch feature engineering; batch inference inside SQL pipelines; lightweight classical ML.Distributed training, GPUs/TPUs, online endpoints, streaming beyond Snowflake scope.
LatencyMinutes → hours SLAs; dashboard batch refresh; “run in query”.Sub-100ms online inference; autoscaling microservices.
LibrariesSupported packages via Snowflake Anaconda or staged wheels; deterministic deps.You need CUDA, custom C++ ops, or niche libs not supported in Snowpark runtime.
Ops/GovernanceYou want single RBAC plane, tags, data policies, access history.You need K8s-level control, custom images, sidecars, VPC peering patterns.
Cost ModelCompute near data, no egress; Snowpark-optimized WH for vectorized ops.Long-running clusters are cheaper for massive, sustained jobs or GPU training.

Heuristic:

  • Batch scoring for a BI table? Snowpark UDF.
  • ETL w/ Python transforms on VARIANT/JSON? Snowpark UDF/UDTF.
  • Train XGBoost on 2B rows with GPUs? External.
  • Real-time REST scoring at 30ms p95? External endpoint, write results back to Snowflake.

Production Patterns (with minimal, correct code)

1) Scalar Python UDF (deterministic, light deps)

CREATE OR REPLACE SECURE FUNCTION util.normalize_email(s STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ()
HANDLER = 'normalize'
AS
$$
def normalize(s: str) -> str:
    if s is None:
        return None
    s = s.strip().lower()
    # cheap guardrails — keep UDF deterministic & pure
    return s
$$;

Why this pattern works:

  • SECURE prevents result leakage in caches/shares.
  • No external calls or heavy imports → predictable, scalable.
  • Pure function → easy to test, easy to reason about.

2) Vectorized (pandas) UDF for speed

CREATE OR REPLACE SECURE FUNCTION feature.tokenize(text ARRAY)
RETURNS ARRAY
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('pandas')
HANDLER = 'tokenize'
AS
$$
import pandas as pd

def tokenize(text: pd.Series) -> pd.Series:
    # vectorized split; returns list per element
    return text.fillna("").str.lower().str.replace(r'[^a-z0-9 ]','',regex=True).str.split()
$$;

Notes:

  • Accept/return pandas Series for batch processing → fewer Python crossings.
  • Keep package set small; pin versions if you depend on behavior.

3) Custom UDAF (aggregate) via Snowpark Python

# register from a Snowpark session (Python client)
from snowflake.snowpark import Session

class RobustAvg:
    def __init__(self):
        self.n = 0
        self.s = 0.0

    def accumulate(self, x):
        if x is not None:
            self.n += 1
            self.s += float(x)

    def merge(self, other):
        self.n += other.n
        self.s += other.s

    def finish(self):
        return None if self.n == 0 else self.s / self.n

session.udaf.register(udaf_class=RobustAvg,
                      name="metrics.robust_avg",
                      replace=True,
                      is_permanent=True,
                      stage_location="@prod_code_stage",
                      packages=[])

Guidance: Only implement UDAFs when there’s no equivalent in SQL (e.g., robust stats, sketch structures). Prefer native AVG, APPROX_*, HLL_* where possible.


4) Packaging & Dependencies (the part that bites teams)

Rules of thumb:

  • Prefer Anaconda-available packages via PACKAGES = ('pandas==2.2.2','numpy==1.26.*'). Pin versions.
  • For custom libs, upload wheels/zip to an internal stage and reference via IMPORTS/session.add_import() or CREATE FUNCTION … IMPORTS=('@stage/my_pkg.whl'). Keep them small.
  • Single source of truth: store environment.yml (or constraints txt) in Git; generate DDL from it in CI.
  • No network in UDFs by default. If you must call out, use External Access Integration + Secret and justify it in review.

DDL example with custom import:

CREATE OR REPLACE SECURE FUNCTION nlp.score(text STRING)
RETURNS FLOAT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('numpy==1.26.4')
IMPORTS = ('@prod_code_stage/nlp_model-1.0.0-py3-none-any.whl')
HANDLER = 'inference.score'
AS
$$
from nlp_model import predictor
def score(text):
    return float(predictor.predict(text or ""))
$$;

Secure ML in Snowflake (governed patterns)

Recommended flows:

  1. Feature engineering in-warehouse (Snowpark/SQL). Persist features as governed tables with tags.
  2. Training
    • Small/medium classical: use Snowpark Python in a dedicated warehouse; write artifacts to a stage.
    • GPU/Deep learning: train externally (SageMaker/Databricks).
    • Register models (if you use Snowflake’s model registry or a central registry); persist metadata & version tags.
  3. Batch inference with a SECURE scalar/pandas UDF that loads a frozen artifact (no network).
  4. Scheduling with TASK + PROCEDURE to write scored tables for BI (don’t score on the fly for every dashboard query).
  5. Governance:
    • Tag model/version/dataset lineage (TAG model_name, model_version, training_data_scn).
    • Row-access/masking on inputs to avoid unintended exposure.
    • Access History for audit.
    • Keep UDFs deterministic—nondeterminism complicates audits and caching behavior.

Anti-pattern: online, per-row UDF that downloads a model from the internet on every call. Use preloaded imports; for online, use an external service.


Best Practices (hard-won)

  • Bias to SQL first. Only reach for Python when the expression is unreadable or impossible in SQL.
  • Keep UDFs tiny and pure. No I/O, no global state, no randomness.
  • Vectorize or don’t. If performance matters and your logic fits, prefer pandas UDFs.
  • Pin everything. Runtime version, package versions, and model artifact checksum/version.
  • Make it SECURE. Default new prod functions to SECURE.
  • Permanent + staged code. Use is_permanent=True and a prod stage; never register from a laptop ad-hoc for prod.
  • Observability:
    • Query/Task history for runtimes & errors.
    • Emit lightweight metrics to a control table (proc_metrics) at step boundaries in procedures.
    • Tag every deployed object with app, owner, version, risk_level.
  • Warehouse choice:
    • For pandas UDFs or heavy Python: Snowpark-optimized warehouses.
    • Right-size and autosuspend aggressively.
  • Review gates:
    • CI builds the artifact wheel/zip, validates imports, runs a canary in a dev Snowflake account, then promotes.

Common Pitfalls (and fixes)

  • Pitfall: “It’s slow.”
    Fix: Remove Python; express as SQL; if not possible, switch to pandas UDF; shrink imports; avoid per-row heavy init; ensure warehouse is Snowpark-optimized and sized correctly.
  • Pitfall: “Works in dev, explodes in prod.”
    Fix: Pin versions; promote immutable wheels; never rely on implicit latest packages.
  • Pitfall: “Security won’t approve.”
    Fix: Use SECURE objects, no outbound calls, attach data-classification tags, show Access History lineage. If external calls required, define External Access Integration + Secret with least privilege.
  • Pitfall: “BI queries call the UDF 10M times.”
    Fix: Shift to batch scoring into a table; BI reads the materialized column.
  • Pitfall: “We need real-time.”
    Fix: Build an external inference endpoint. Use Snowflake for features and persistence; write back results or join via CDC.

End-to-End Example: Batch ML Scoring with a SECURE UDF

1) Register inference UDF with frozen model import

CREATE OR REPLACE SECURE FUNCTION ml.credit_score(v VARIANT)
RETURNS FLOAT
LANGUAGE PYTHON
RUNTIME_VERSION='3.10'
PACKAGES=('pandas==2.2.2','numpy==1.26.4')
IMPORTS=('@prod_code_stage/credit_model-2.1.0.whl')
HANDLER='serve.predict'
AS
$$
import pandas as pd
from credit_model import model

def predict(v):
    df = pd.DataFrame([v])  # expect v as object/VARIANT with fields
    return float(model.predict(df)[0])
$$;

2) Batch inference into a governed table

CREATE OR REPLACE TABLE SCORED_APPS AS
SELECT a.APP_ID,
       ml.credit_score(OBJECT_CONSTRUCT(*)) AS SCORE,
       CURRENT_TIMESTAMP() AS SCORED_AT,
       '2.1.0' AS MODEL_VERSION
FROM RAW_APPS a;

-- add tags for lineage & ownership (example)
ALTER TABLE SCORED_APPS SET TAG model_name = 'credit', model_version = '2.1.0';

3) Schedule with a task calling a procedure (retry, metrics)

CREATE OR REPLACE PROCEDURE jobs.score_apps()
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION='3.10'
HANDLER='run'
AS
$$
from snowflake.snowpark import Session
def run(session: Session):
    session.sql("""
        MERGE INTO SCORED_APPS t
        USING (SELECT APP_ID, ml.credit_score(OBJECT_CONSTRUCT(*)) AS SCORE FROM RAW_APPS) s
        ON t.APP_ID = s.APP_ID
        WHEN MATCHED THEN UPDATE SET SCORE = s.SCORE, SCORED_AT = CURRENT_TIMESTAMP()
        WHEN NOT MATCHED THEN INSERT (APP_ID, SCORE, SCORED_AT, MODEL_VERSION)
             VALUES (s.APP_ID, s.SCORE, CURRENT_TIMESTAMP(), '2.1.0')
    """).collect()
    session.sql("INSERT INTO PROC_METRICS(EVENT,DETAILS) VALUES('score_apps_ok',CURRENT_TIMESTAMP())").collect()
    return 'ok'
$$;

CREATE OR REPLACE TASK task_score_apps
  WAREHOUSE = SPARKY_M
  SCHEDULE = 'USING CRON 0 * * * * America/New_York'
AS CALL jobs.score_apps();

Governance Checklist (ship this with your PR)

  • SECURE on UDF/UDTF/UDAF/procedure
  • Runtime + packages pinned
  • Imports from versioned stage path (@prod_code_stage/.../v2.1.0/...)
  • No network calls, or approved External Access Integration + Secret
  • Object TAGS: owner, app, model_name, model_version, pii_level
  • Access policies on input tables (masking/row-access)
  • Observability: query/task history, control-table metrics
  • CI: build wheel, dependency audit, canary run in dev account

Internal link ideas (add these to your site)

  • “Snowflake Masking & Row-Access Policies: A Practical Guide”
  • “Pandas UDFs vs SQL: When Does Vectorization Pay Off in Snowflake?”
  • “External Access Integrations & Secrets: Safe Patterns”
  • “Batch vs Online Inference: An Architecture Decision Record Template”
  • “Pinning Python Environments in Snowflake: environment.yml to DDL”

Conclusion & Takeaways

  • Use Snowpark where data lives for governed, batch-oriented Python and ML scoring.
  • Stay boring: keep functions pure, packages pinned, and rely on SQL where possible.
  • Know when to say no: for GPUs, real-time SLAs, or exotic deps, move to external compute and integrate via pipelines.
  • Governance isn’t optionalSECURE, tags, policies, and auditable deployments turn Python-in-warehouse from a hack into a product.

CTA: Want me to tailor this into your team’s deployment checklist and CI templates (DDL + Python scaffolds)? Say the word and I’ll generate a ready-to-use repo layout.


Image prompt

“A clean, modern architecture diagram of Snowflake with Snowpark Python UDFs/UDAFs and a batch ML scoring flow: secure UDF, staged model artifact, task + procedure scheduling, and governance layers (tags, masking, access history). Minimalist, high-contrast, isometric style.”

Tags

#Snowpark #Snowflake #Python #UDF #UDAF #MLOps #DataEngineering #Governance #BatchInference #Pandas

Leave a Reply

Your email address will not be published. Required fields are marked *