PostgreSQL at Scale for Data Engineers: Partitioning, Indexes, and Query Plans That Actually Work
Why this matters
You ship pipelines, not whitepapers. When a fact table jumps from 50M to 5B rows, “we’ll add an index later” turns into 2 a.m. firefighting. PostgreSQL can absolutely power serious analytics and mixed workloads—if you respect its planner, choose the right indexes, and partition with intent. This guide is the pragmatic, mid-level path from “it runs” to “it scales.”
PostgreSQL mental model (for builders)
Think of Postgres as:
- Row store with rich types, transactions, MVCC, and a very smart cost-based planner.
- Disk-first: I/O patterns, statistics, and visibility maps dominate performance.
- Modular: extensions (e.g.,
pg_stat_statements), FDWs, logical replication. - Honest: if your stats or indexes are wrong, it will do exactly what you told it—even if that’s a table scan from hell.
Query planning you can trust (and verify)
If you don’t read plans, you’re guessing.
-- Always check actual timing, buffers, and parallelism decisions
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT user_id, sum(amount)
FROM payments
WHERE created_at >= now() - interval '7 days'
GROUP BY 1;
What to look for:
- Seq Scan vs Index Scan: did Postgres use the right path?
- Rows: estimated vs actual—if off by 10×, your stats need work (
ANALYZE,default_statistics_target). - Buffers: how much I/O? If it’s all shared reads, increase cache hit ratio or narrow the scans.
- Parallel workers: did it go parallel? If not, is
max_parallel_workers_per_gathertoo low or the plan not parallel-safe?
The index types that matter (and when)
| Index | Best for | Avoid when | Notes |
|---|---|---|---|
| B-tree (default) | Equality/range on sortable columns | Highly selective full-text/arrays | Most OLTP/OLAP filter/join workhorses |
| Hash | Pure equality | Anything else | Rarely needed; B-tree often wins |
| GIN | Full-text (tsvector), arrays, JSONB keys | Heavy writes | Great for @>, ?, text search |
| GiST | Spatial, ranges, nearest-neighbor | High-churn tables | Used by PostGIS; flexible tree |
| BRIN | Massive, naturally ordered data (timestamps, IDs) | Small/chaotic tables | Micro-indexes; skip-scan large ranges |
Practical tips
- Composite indexes follow leftmost prefix rules:
CREATE INDEX ON events (tenant_id, created_at DESC); -- Helps WHERE tenant_id = ? AND created_at > ? -- Won't help WHERE created_at > ? alone - Add covering columns to avoid lookups:
CREATE INDEX ON orders (customer_id, status) INCLUDE (total_amount); - For JSONB, index paths you filter:
CREATE INDEX ON sessions USING GIN (payload jsonb_path_ops); -- WHERE payload ? 'device' OR payload @> '{"country":"US"}'
Partitioning that pays for itself
Partition when one or more are true:
- Table > ~200–300 GB or > ~1–2 B rows.
- Predicates are time-bounded or tenant-bounded.
- You routinely drop old data.
Range (time) partitioning
CREATE TABLE fact_events (
event_ts timestamptz NOT NULL,
tenant_id int NOT NULL,
payload jsonb
) PARTITION BY RANGE (event_ts);
CREATE TABLE fact_events_2025_11 PARTITION OF fact_events
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
-- Add next month ahead of time
CREATE TABLE fact_events_2025_12 PARTITION OF fact_events
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
Why it works
- Pruning: planner touches only relevant partitions.
- Maintenance:
VACUUM/ANALYZEstats stay accurate per partition. - Lifecycle: drop old partitions fast (metadata-only).
Gotchas
- Keep default partition to catch stragglers.
- Ensure partition key appears in your common filters, or pruning won’t trigger.
- Index each partition (or use global indexes in Postgres 17+ if available in your environment).
Ingestion that doesn’t melt disks
Use COPY, not a million inserts
COPY staging_payments FROM '/data/payments_2025_11.csv' CSV HEADER;
-- Or from app host:
\copy staging_payments FROM 'payments.csv' CSV HEADER
Batch into the target with constraints
INSERT INTO payments (id, user_id, amount, created_at)
SELECT id, user_id, amount, created_at
FROM staging_payments
ON CONFLICT (id) DO UPDATE
SET amount = EXCLUDED.amount;
Throughput boosters
- Disable unneeded synchronous commit for bulk loads:
SET LOCAL synchronous_commit = OFF; - Pre-sort by partition key to improve locality.
- Keep indexes minimal on hot write tables; add read-only indexes post-load or off-peak.
Replication & HA you won’t regret
- Physical (streaming) replication: byte-level, great for HA/readonly replicas.
- Logical replication: table-level, schema-aware, supports selective replication and version-skewed upgrades.
-- On publisher
CREATE PUBLICATION pub_payments FOR TABLE payments;
-- On subscriber
CREATE SUBSCRIPTION sub_payments
CONNECTION 'host=primary dbname=app user=replicator password=***'
PUBLICATION pub_payments;
Use logical for: blue/green cutovers, feeding downstream services, multi-tenant splits.
Use physical for: simple HA, read scaling, disaster recovery.
Observability and tuning that move the needle
Always install these extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- for LIKE/ILIKE on text
Daily habits
- Track top normalized queries (
pg_stat_statements). - Pin a slow-query budget (e.g., p95 < 200 ms) and tune to it.
- Vacuum debt is real: monitor
pg_stat_all_tables.vacuum_count, bloat estimates, and autovacuum lag.
Sensible starting points (not dogma)
| Setting | Why it matters | Starting heuristic |
|---|---|---|
work_mem | Sort/hash per node | 32–128 MB; size to avoid disk spill in typical plans |
shared_buffers | Page cache | 25% of RAM (cap sensibly) |
maintenance_work_mem | VACUUM/CREATE INDEX | 512 MB–2 GB during maintenance |
effective_cache_size | Planner hint | ~50–75% of RAM |
max_parallel_workers_per_gather | Parallelism | 2–4 for analytics nodes |
autovacuum_vacuum_scale_factor | Bloat control | Lower for big tables (e.g., 0.05) |
Brutal truth: if you never read
EXPLAIN (ANALYZE, BUFFERS)and you never tune autovacuum, your “scaling problem” is self-inflicted.
Real example: tenant-sharded + time-partitioned fact table
-- 1) Parent table
CREATE TABLE fact_pageviews (
ts timestamptz NOT NULL,
tenant_id int NOT NULL,
user_id bigint,
page text,
attrib jsonb,
PRIMARY KEY (tenant_id, ts, user_id)
) PARTITION BY RANGE (ts);
-- 2) Monthly partitions + per-partition indexes
-- (repeat per month; automate this)
CREATE TABLE fact_pageviews_2025_11 PARTITION OF fact_pageviews
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
CREATE INDEX ON fact_pageviews_2025_11 (tenant_id, ts DESC);
CREATE INDEX ON fact_pageviews_2025_11 USING BRIN (ts);
-- 3) Query pattern that prunes + uses index
EXPLAIN (ANALYZE, BUFFERS)
SELECT tenant_id, date_trunc('day', ts) AS d, count(*) AS views
FROM fact_pageviews
WHERE tenant_id = 42
AND ts >= '2025-11-15' AND ts < '2025-11-22'
GROUP BY 1, 2;
Why this scales
- Tenant filter hits the leading index column.
- Time filter prunes to one partition + BRIN skip-scans.
- Daily rollups remain index-friendly without full table scans.
Common pitfalls (seen in the wild)
- Singleton inserts for bulk loads → use
COPYor batched inserts. - JSONB everywhere → model hot filters as proper columns; index only what you query.
- “One giant index to rule them all” → composite indexes tuned to query shapes beat kitchen-sink indexes.
- No partition lifecycle → forgetting to create/drop partitions turns pruning into pain.
- Vacuum starvation → long-running transactions block cleanup and cause bloat; watch for idle in transaction.
Quick checklist for new tables
- Define primary access patterns (filters, joins, sort).
- Choose minimal composite indexes to satisfy them.
- Decide partitioning (usually monthly by time for event data).
- Wire EXPLAIN into CI for heavy queries.
- Install and review
pg_stat_statementsweekly. - Automate partition creation and retention.
- Confirm autovacuum thresholds fit table churn.
Internal link ideas (official docs only)
- Table Partitioning (official docs)
- Indexes and Index Types (official docs)
- Query Planning &
EXPLAIN(official docs) - Parallel Query (official docs)
COPYFrom/To (official docs)- Logical Replication (official docs)
pg_stat_statements(official docs)- Routine Vacuuming & Autovacuum (official docs)
(When publishing, link each bullet to the corresponding page on postgresql.org.)
Summary & call to action
PostgreSQL isn’t “just OLTP.” With sane indexes, partitioning, and a habit of reading plans, it comfortably powers multi-billion-row warehouses, event pipelines, and mixed workloads. Start with your query shapes, keep your stats honest, and automate the boring lifecycle knobs.
CTA: Pick one painful query today. Run EXPLAIN (ANALYZE, BUFFERS), add the one index it truly needs, and set up pg_stat_statements to keep yourself honest next week.
Image prompt (for DALL·E / Midjourney)
“A clean, modern data architecture diagram illustrating a PostgreSQL cluster with partitioned tables, read replicas, and query plans flowing through indexes — minimalistic, high contrast, 3D isometric style.”
Tags
#PostgreSQL #DataEngineering #DatabasePerformance #Indexes #Partitioning #EXPLAIN #Replication #Analytics #Scalability
Pitch ideas (choose one next)
“Parallel Query and Work Mem: Making Postgres Use All Those Cores” — practical knobs and before/after plans.
“Designing Composite Indexes in PostgreSQL: A Data Engineer’s Playbook” — deep on leftmost rules, INCLUDE columns, and plan reading.
“From Heap to Healthy: Vacuum, Bloat, and MVCC Explained for Humans” — practical autovacuum tuning with measurable thresholds.
“PostgreSQL Partitioning in Practice: Monthly, Hash, and Hybrid Patterns” — blueprints + automation snippets.
“Logical vs Physical Replication: Blue/Green, Zero-Downtime Upgrades, and Read Scaling” — real cutover runbooks.
“Ingesting a Billion Rows: COPY, Staging Tables, and Idempotent Upserts” — reproducible ingestion pipelines.
“Query Plans that Don’t Lie: Building a Performance Review with pg_stat_statements” — set targets, track regressions.
“JSONB Without Regret: When to Model vs Index Paths” — decision tree + costed examples.




