SQL for Data Engineers: The Language Behind Snowflake, BigQuery, Postgres, and Every Reliable Pipeline

A practical guide to SQL for data engineers: core patterns, window functions, incremental models, performance tips, and where SQL runs—Snowflake to DuckDB.


SQL for Data Engineers: The Language Behind Snowflake, BigQuery, Postgres, and Every Reliable Pipeline

Why this matters

Every broken dashboard I’ve ever seen boils down to one of two things: bad data, or logic scattered across five code paths. SQL solves the second problem—by putting business logic where the data lives. If you build data models, materialize features, or run ELT in a modern warehouse, SQL is non-negotiable.

This article gives you the essentials data engineers actually use: clear concepts, copy-pasteable patterns, and which engines (Snowflake, BigQuery, Redshift, Postgres, MySQL, SQL Server, Oracle, Databricks/Spark SQL, Trino/Presto, DuckDB, SQLite, MariaDB, Teradata) fit which jobs.


Where SQL runs (and why you’d pick each)

Cloud warehouses (OLAP)

  • Snowflake — Separation of storage/compute, automatic micro-partitioning, great concurrency. Strong for ELT, secure data sharing, Snowpark UDFs.
  • BigQuery — Serverless MPP, partition/cluster at table level, BI Engine for speed. Excellent for huge analytic scans.
  • Amazon Redshift — Classic MPP with sort/dist keys; good when you’re already deep in AWS and want predictable clusters.

Relational databases (OLTP + analytics on smaller data)

  • PostgreSQL — Rich SQL + extensions (PostGIS, Timescale), great indexes and CTEs. Workhorse for application data and modest analytics.
  • MySQL / MariaDB — Massive adoption for apps; fine for light reporting.
  • SQL Server (T-SQL) — Enterprise features, SSIS/SSRS ecosystem.
  • Oracle — Deep enterprise features, very mature optimizer.
  • SQLite — Zero-install file DB; perfect for local dev and tiny jobs.

Query engines & lakehouses

  • Databricks / Spark SQL — SQL on Delta Lake; great with medallion architecture; pairs with notebooks and jobs.
  • Trino/Presto — Federated queries across many sources; popular for interactive analytics at scale.
  • DuckDB — In-process columnar OLAP; amazing for local analytics, CI checks, fast prototyping.

Rule of thumb: Use warehouses/lakehouses for heavy analytics and ELT; Postgres/MySQL for app data and quick reporting; DuckDB for local dev/validation.


Core SQL concepts (no fluff)

Tables, rows, and relations

SQL is a declarative language: you describe what you want, and the engine decides how. The power comes from relations (tables/views) and set operations.

SELECT the minimum viable data

-- Only select the columns you need; avoid SELECT *
SELECT order_id, customer_id, order_total, order_ts
FROM analytics.orders
WHERE order_ts >= DATEADD(day, -7, CURRENT_DATE);

Joins you actually use

-- Left join keeps all orders even if there’s no customer match
SELECT o.order_id, o.order_total, c.segment
FROM analytics.orders AS o
LEFT JOIN analytics.customers AS c
  ON o.customer_id = c.customer_id;
  • INNER = intersection
  • LEFT = “keep the fact table; enrich if possible”
  • ANTI (NOT EXISTS) = find missing keys (great for QA)

Aggregations and grouping

SELECT DATE_TRUNC('day', order_ts) AS day, SUM(order_total) AS revenue
FROM analytics.orders
GROUP BY 1
ORDER BY 1;

Window functions (your superpower)

-- Running 7-day revenue and customer ranking by spend
WITH daily AS (
  SELECT DATE_TRUNC('day', order_ts) AS day, customer_id, SUM(order_total) AS rev
  FROM analytics.orders
  GROUP BY 1, 2
)
SELECT
  day,
  customer_id,
  rev,
  SUM(rev) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7d,
  DENSE_RANK() OVER (PARTITION BY day ORDER BY rev DESC) AS rank_by_day
FROM daily;

CTEs (structure > spaghetti)

WITH filtered AS (
  SELECT * FROM staging.events
  WHERE event_ts >= DATEADD(day, -30, CURRENT_DATE)
),
deduped AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_ts DESC) AS rn
  FROM filtered
)
SELECT * FROM deduped WHERE rn = 1;

Conditional logic and nulls

SELECT
  order_id,
  COALESCE(discount_amount, 0) AS discount_amount,
  CASE
    WHEN order_total > 500 THEN 'VIP'
    WHEN order_total > 100 THEN 'PLUS'
    ELSE 'BASE'
  END AS tier
FROM analytics.orders;
  • Remember: NULL is “unknown.” NULL = NULL is not true; use IS NULL and COALESCE.

DDL/DML you’ll use weekly

Create tables with enforceable contracts

CREATE TABLE analytics.orders (
  order_id       VARCHAR PRIMARY KEY,
  customer_id    VARCHAR NOT NULL,
  order_total    DECIMAL(12,2) NOT NULL CHECK (order_total >= 0),
  currency       VARCHAR(3) NOT NULL,
  order_ts       TIMESTAMP NOT NULL
);
  • Constraints encode business rules. Even warehouses that don’t fully enforce them can use them for optimizers and documentation.

Upserts (incremental ELT)

  • Snowflake/BigQuery/SQL Server/Oracle: MERGE
  • Postgres: INSERT … ON CONFLICT DO UPDATE
  • MySQL/MariaDB: INSERT … ON DUPLICATE KEY UPDATE

Example (warehouse-friendly MERGE):

MERGE INTO analytics.orders AS tgt
USING staging.orders_incremental AS src
  ON tgt.order_id = src.order_id
WHEN MATCHED THEN UPDATE SET
  customer_id  = src.customer_id,
  order_total  = src.order_total,
  currency     = src.currency,
  order_ts     = src.order_ts
WHEN NOT MATCHED THEN INSERT (
  order_id, customer_id, order_total, currency, order_ts
) VALUES (
  src.order_id, src.customer_id, src.order_total, src.currency, src.order_ts
);

Example (Postgres):

INSERT INTO analytics.orders AS tgt (order_id, customer_id, order_total, currency, order_ts)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (order_id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
    order_total = EXCLUDED.order_total,
    currency    = EXCLUDED.currency,
    order_ts    = EXCLUDED.order_ts;

Performance & cost: what actually moves the needle

Filter and prune early

  • Always filter on partition/cluster keys when available:
    • BigQuery: WHERE _PARTITIONDATE >= '2025-01-01'
    • Snowflake: filter on natural partition columns (e.g., order_ts); clustering keys help long-wide tables.
    • Redshift: choose sort/dist keys aligned with common joins/filters.
    • Delta Lake (Databricks/Spark SQL): partition by date or high-level tenant, not by ultra-high cardinality columns.

Avoid SELECT * in production

Scan cost and network I/O go up; columnar engines only read needed columns if you select them.

Use the right aggregate shape

  • Replace distinct counts on huge tables with pre-aggregations or sketches (HLL in BigQuery, approximate functions in Trino/Presto).

Window functions: be explicit

  • Define PARTITION BY and ORDER BY carefully—these trigger heavy sorts/shuffles.

Indexes vs layout

  • OLTP (Postgres/MySQL/SQL Server/Oracle): create indexes for frequent filters/joins.
  • OLAP (Snowflake/BigQuery/Redshift/Spark): you don’t create B-Tree indexes; you design partitions/clustering/sort keys and file sizes.

Concurrency & transactions

  • Warehouses are great at read concurrency; OLTP engines handle tight transactions. Don’t run nightly ELT on your app database unless you enjoy paging.

Real-world pattern: building a reliable orders fact

Goal: Materialize fact_orders daily with incremental updates, deduplication, and schema guarantees.

  1. Land raw to staging.orders_raw (append-only).
  2. Deduplicate by (order_id, updated_ts) using a window:
CREATE OR REPLACE TEMP VIEW staging.orders_dedup AS
SELECT *
FROM (
  SELECT r.*,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_ts DESC) AS rn
  FROM staging.orders_raw r
)
WHERE rn = 1;
  1. Validate with SQL constraints (and/or Great Expectations for richer checks).
  2. Upsert into analytics.orders with MERGE (shown earlier).
  3. Publish a consumption view:
CREATE OR REPLACE VIEW marts.fact_orders AS
SELECT
  o.order_id,
  o.customer_id,
  o.order_total,
  o.currency,
  o.order_ts,
  DATE_TRUNC('day', o.order_ts) AS order_day
FROM analytics.orders o;
  1. Downstream models now read from a stable, documented view—no one touches the staging tables.

Common pitfalls (seen in the wild)

  • Duplicate rows after joins — you joined a one-to-many without aggregating the many side. Fix with GROUP BY before the join, or dedupe with ROW_NUMBER().
  • Implicit type castsWHERE id = '123' may bypass indexes or cause scans. Match types.
  • Null logic mistakes — remember NULL propagates; wrap with COALESCE.
  • COUNT(*) on massive tables — prefer metadata tables or approximations in warehouses.
  • CTEs that materialize — some engines (not all) materialize each CTE; in others they inline. Profile and refactor to subqueries if needed.

Choosing the right engine (cheat sheet)

EngineBest ForAvoid WhenNotes
SnowflakeELT, shared data, mixed workloadsUltra-low latency streamingClustering keys help very large tables; Snowpark for UDFs
BigQueryHuge scans, serverless, GCP-nativeHigh-frequency small updatesPartition & cluster; beware per-query costs
RedshiftAWS-aligned MPPBursty workloads without RA3 or autoscalingTune sort/dist keys
PostgresApp data, small/medium analyticsPetabyte-scale scansExtensions are gold (PostGIS, Timescale)
MySQL/MariaDBApp OLTPComplex analyticsGreat for reads with replicas
SQL ServerEnterprise OLTP+BICloud-agnostic needsT-SQL is powerful; SSIS/SSRS ecosystem
OracleEnterprise OLTPCost sensitivityVery mature optimizer
Databricks/Spark SQLLakehouse, streaming + batchLow-latency OLTPDelta Lake + medallion works well
Trino/PrestoFederated analyticsHeavy DMLGreat for interactive BI over many sources
DuckDBLocal analytics, CI testsMulti-user concurrencyPerfect for notebook exploration
SQLiteLocal apps, prototypingMulti-writer workloadsZero-ops, single file

Best practices that pay off immediately

  • Model in layers: staging → cleaned → marts (a.k.a. bronze/silver/gold).
  • Version your models: use dbt or SQL files in Git; code review SQL like Python.
  • Document contracts: column types, nullability, allowed values; enforce with constraints where possible.
  • Test with data: not just unit tests—use row-count checks, not-null checks, referential integrity tests.
  • Keep business logic in views/models: not sprinkled across apps and dashboards.
  • Profile queries: learn your engine’s EXPLAIN and query profile UI.

Wrap-up

SQL is the lingua franca of data engineering because it’s declarative, portable, and runs at the source of truth. Master a warehouse dialect (Snowflake or BigQuery), stay fluent in Postgres, and know your way around lakehouse SQL (Databricks/Spark SQL or Trino). With that, you can build reliable pipelines, faster.