SQL for Data Engineers: The Language Behind Snowflake, BigQuery, Postgres, and Every Reliable Pipeline
A practical guide to SQL for data engineers: core patterns, window functions, incremental models, performance tips, and where SQL runs—Snowflake to DuckDB.
SQL for Data Engineers: The Language Behind Snowflake, BigQuery, Postgres, and Every Reliable Pipeline
Why this matters
Every broken dashboard I’ve ever seen boils down to one of two things: bad data, or logic scattered across five code paths. SQL solves the second problem—by putting business logic where the data lives. If you build data models, materialize features, or run ELT in a modern warehouse, SQL is non-negotiable.
This article gives you the essentials data engineers actually use: clear concepts, copy-pasteable patterns, and which engines (Snowflake, BigQuery, Redshift, Postgres, MySQL, SQL Server, Oracle, Databricks/Spark SQL, Trino/Presto, DuckDB, SQLite, MariaDB, Teradata) fit which jobs.
Where SQL runs (and why you’d pick each)
Cloud warehouses (OLAP)
- Snowflake — Separation of storage/compute, automatic micro-partitioning, great concurrency. Strong for ELT, secure data sharing, Snowpark UDFs.
- BigQuery — Serverless MPP, partition/cluster at table level, BI Engine for speed. Excellent for huge analytic scans.
- Amazon Redshift — Classic MPP with sort/dist keys; good when you’re already deep in AWS and want predictable clusters.
Relational databases (OLTP + analytics on smaller data)
- PostgreSQL — Rich SQL + extensions (PostGIS, Timescale), great indexes and CTEs. Workhorse for application data and modest analytics.
- MySQL / MariaDB — Massive adoption for apps; fine for light reporting.
- SQL Server (T-SQL) — Enterprise features, SSIS/SSRS ecosystem.
- Oracle — Deep enterprise features, very mature optimizer.
- SQLite — Zero-install file DB; perfect for local dev and tiny jobs.
Query engines & lakehouses
- Databricks / Spark SQL — SQL on Delta Lake; great with medallion architecture; pairs with notebooks and jobs.
- Trino/Presto — Federated queries across many sources; popular for interactive analytics at scale.
- DuckDB — In-process columnar OLAP; amazing for local analytics, CI checks, fast prototyping.
Rule of thumb: Use warehouses/lakehouses for heavy analytics and ELT; Postgres/MySQL for app data and quick reporting; DuckDB for local dev/validation.
Core SQL concepts (no fluff)
Tables, rows, and relations
SQL is a declarative language: you describe what you want, and the engine decides how. The power comes from relations (tables/views) and set operations.
SELECT the minimum viable data
-- Only select the columns you need; avoid SELECT *
SELECT order_id, customer_id, order_total, order_ts
FROM analytics.orders
WHERE order_ts >= DATEADD(day, -7, CURRENT_DATE);
Joins you actually use
-- Left join keeps all orders even if there’s no customer match
SELECT o.order_id, o.order_total, c.segment
FROM analytics.orders AS o
LEFT JOIN analytics.customers AS c
ON o.customer_id = c.customer_id;
- INNER = intersection
- LEFT = “keep the fact table; enrich if possible”
- ANTI (NOT EXISTS) = find missing keys (great for QA)
Aggregations and grouping
SELECT DATE_TRUNC('day', order_ts) AS day, SUM(order_total) AS revenue
FROM analytics.orders
GROUP BY 1
ORDER BY 1;
Window functions (your superpower)
-- Running 7-day revenue and customer ranking by spend
WITH daily AS (
SELECT DATE_TRUNC('day', order_ts) AS day, customer_id, SUM(order_total) AS rev
FROM analytics.orders
GROUP BY 1, 2
)
SELECT
day,
customer_id,
rev,
SUM(rev) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7d,
DENSE_RANK() OVER (PARTITION BY day ORDER BY rev DESC) AS rank_by_day
FROM daily;
CTEs (structure > spaghetti)
WITH filtered AS (
SELECT * FROM staging.events
WHERE event_ts >= DATEADD(day, -30, CURRENT_DATE)
),
deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_ts DESC) AS rn
FROM filtered
)
SELECT * FROM deduped WHERE rn = 1;
Conditional logic and nulls
SELECT
order_id,
COALESCE(discount_amount, 0) AS discount_amount,
CASE
WHEN order_total > 500 THEN 'VIP'
WHEN order_total > 100 THEN 'PLUS'
ELSE 'BASE'
END AS tier
FROM analytics.orders;
- Remember:
NULLis “unknown.”NULL = NULLis not true; useIS NULLandCOALESCE.
DDL/DML you’ll use weekly
Create tables with enforceable contracts
CREATE TABLE analytics.orders (
order_id VARCHAR PRIMARY KEY,
customer_id VARCHAR NOT NULL,
order_total DECIMAL(12,2) NOT NULL CHECK (order_total >= 0),
currency VARCHAR(3) NOT NULL,
order_ts TIMESTAMP NOT NULL
);
- Constraints encode business rules. Even warehouses that don’t fully enforce them can use them for optimizers and documentation.
Upserts (incremental ELT)
- Snowflake/BigQuery/SQL Server/Oracle:
MERGE - Postgres:
INSERT … ON CONFLICT DO UPDATE - MySQL/MariaDB:
INSERT … ON DUPLICATE KEY UPDATE
Example (warehouse-friendly MERGE):
MERGE INTO analytics.orders AS tgt
USING staging.orders_incremental AS src
ON tgt.order_id = src.order_id
WHEN MATCHED THEN UPDATE SET
customer_id = src.customer_id,
order_total = src.order_total,
currency = src.currency,
order_ts = src.order_ts
WHEN NOT MATCHED THEN INSERT (
order_id, customer_id, order_total, currency, order_ts
) VALUES (
src.order_id, src.customer_id, src.order_total, src.currency, src.order_ts
);
Example (Postgres):
INSERT INTO analytics.orders AS tgt (order_id, customer_id, order_total, currency, order_ts)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (order_id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
order_total = EXCLUDED.order_total,
currency = EXCLUDED.currency,
order_ts = EXCLUDED.order_ts;
Performance & cost: what actually moves the needle
Filter and prune early
- Always filter on partition/cluster keys when available:
- BigQuery:
WHERE _PARTITIONDATE >= '2025-01-01' - Snowflake: filter on natural partition columns (e.g.,
order_ts); clustering keys help long-wide tables. - Redshift: choose sort/dist keys aligned with common joins/filters.
- Delta Lake (Databricks/Spark SQL): partition by date or high-level tenant, not by ultra-high cardinality columns.
- BigQuery:
Avoid SELECT * in production
Scan cost and network I/O go up; columnar engines only read needed columns if you select them.
Use the right aggregate shape
- Replace distinct counts on huge tables with pre-aggregations or sketches (HLL in BigQuery, approximate functions in Trino/Presto).
Window functions: be explicit
- Define
PARTITION BYandORDER BYcarefully—these trigger heavy sorts/shuffles.
Indexes vs layout
- OLTP (Postgres/MySQL/SQL Server/Oracle): create indexes for frequent filters/joins.
- OLAP (Snowflake/BigQuery/Redshift/Spark): you don’t create B-Tree indexes; you design partitions/clustering/sort keys and file sizes.
Concurrency & transactions
- Warehouses are great at read concurrency; OLTP engines handle tight transactions. Don’t run nightly ELT on your app database unless you enjoy paging.
Real-world pattern: building a reliable orders fact
Goal: Materialize fact_orders daily with incremental updates, deduplication, and schema guarantees.
- Land raw to
staging.orders_raw(append-only). - Deduplicate by
(order_id, updated_ts)using a window:
CREATE OR REPLACE TEMP VIEW staging.orders_dedup AS
SELECT *
FROM (
SELECT r.*,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_ts DESC) AS rn
FROM staging.orders_raw r
)
WHERE rn = 1;
- Validate with SQL constraints (and/or Great Expectations for richer checks).
- Upsert into
analytics.orderswithMERGE(shown earlier). - Publish a consumption view:
CREATE OR REPLACE VIEW marts.fact_orders AS
SELECT
o.order_id,
o.customer_id,
o.order_total,
o.currency,
o.order_ts,
DATE_TRUNC('day', o.order_ts) AS order_day
FROM analytics.orders o;
- Downstream models now read from a stable, documented view—no one touches the staging tables.
Common pitfalls (seen in the wild)
- Duplicate rows after joins — you joined a one-to-many without aggregating the many side. Fix with
GROUP BYbefore the join, or dedupe withROW_NUMBER(). - Implicit type casts —
WHERE id = '123'may bypass indexes or cause scans. Match types. - Null logic mistakes — remember
NULLpropagates; wrap withCOALESCE. - COUNT(*) on massive tables — prefer metadata tables or approximations in warehouses.
- CTEs that materialize — some engines (not all) materialize each CTE; in others they inline. Profile and refactor to subqueries if needed.
Choosing the right engine (cheat sheet)
| Engine | Best For | Avoid When | Notes |
|---|---|---|---|
| Snowflake | ELT, shared data, mixed workloads | Ultra-low latency streaming | Clustering keys help very large tables; Snowpark for UDFs |
| BigQuery | Huge scans, serverless, GCP-native | High-frequency small updates | Partition & cluster; beware per-query costs |
| Redshift | AWS-aligned MPP | Bursty workloads without RA3 or autoscaling | Tune sort/dist keys |
| Postgres | App data, small/medium analytics | Petabyte-scale scans | Extensions are gold (PostGIS, Timescale) |
| MySQL/MariaDB | App OLTP | Complex analytics | Great for reads with replicas |
| SQL Server | Enterprise OLTP+BI | Cloud-agnostic needs | T-SQL is powerful; SSIS/SSRS ecosystem |
| Oracle | Enterprise OLTP | Cost sensitivity | Very mature optimizer |
| Databricks/Spark SQL | Lakehouse, streaming + batch | Low-latency OLTP | Delta Lake + medallion works well |
| Trino/Presto | Federated analytics | Heavy DML | Great for interactive BI over many sources |
| DuckDB | Local analytics, CI tests | Multi-user concurrency | Perfect for notebook exploration |
| SQLite | Local apps, prototyping | Multi-writer workloads | Zero-ops, single file |
Best practices that pay off immediately
- Model in layers: staging → cleaned → marts (a.k.a. bronze/silver/gold).
- Version your models: use dbt or SQL files in Git; code review SQL like Python.
- Document contracts: column types, nullability, allowed values; enforce with constraints where possible.
- Test with data: not just unit tests—use row-count checks, not-null checks, referential integrity tests.
- Keep business logic in views/models: not sprinkled across apps and dashboards.
- Profile queries: learn your engine’s EXPLAIN and query profile UI.
Wrap-up
SQL is the lingua franca of data engineering because it’s declarative, portable, and runs at the source of truth. Master a warehouse dialect (Snowflake or BigQuery), stay fluent in Postgres, and know your way around lakehouse SQL (Databricks/Spark SQL or Trino). With that, you can build reliable pipelines, faster.




