Query Plans that Don’t Lie: Building a Performance Review with pg_stat_statements
Hook: Your app got slower after the last release, and everyone’s got a theory: “it’s the network,” “it’s JIT,” “it’s that new index.” Opinions don’t fix latency. Plans and measurements do. In PostgreSQL, the combo of pg_stat_statements + disciplined plan reviews gives you a repeatable way to set performance targets, spot regressions, and prove the fix—without guesswork.
Why this matters
- Speed is a feature. Users feel p95, not averages.
- Regressions hide in plain sight. One small query change can dominate total database time.
- Query plans never lie—estimates do. Use
EXPLAIN (ANALYZE, BUFFERS)to validate reality. pg_stat_statementsis your X-ray. It fingerprints queries, aggregates time, I/O, and call counts across the cluster.
Architecture: a lightweight “Performance Review” loop
- Collect: Enable
pg_stat_statements. Snapshot its metrics every N minutes into a “metrics” schema. - Target: Define SLOs per query family (p95 latency, total DB time budget, rows/second).
- Compare: Daily/weekly rollups. Flag WoW deltas > X% in p95 or total_exec_time.
- Inspect: For flagged queries, run
EXPLAIN (ANALYZE, BUFFERS)on a representative input—plans don’t lie. - Fix: Indexing, rewrite, planner nudges, stats refresh, or config changes (work_mem, jit, parallel).
- Prove: Re-run the same plan; watch the metric roll back below target.
- Prevent: Gate PRs with a small load-test plus a SQL baseline comparison.
This loop is simple enough for a single engineer, robust enough for an SRE team.
Setup: turn on pg_stat_statements
-- postgresql.conf (or ALTER SYSTEM SET)
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all -- include nested
pg_stat_statements.track_utility = off -- focus on DML/SELECT
pg_stat_statements.save = on
-- restart or reload as required, then:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Tip: Set
pg_stat_statements.normalize = on(default) so parameterized queries get the same fingerprint.
What to watch (cheat sheet)
| Field (view) | What it tells you | Use it to… |
|---|---|---|
queryid, query | Fingerprint + normalized SQL | Group and label “query families” |
calls | Frequency | Separate “hot” from “rare” |
total_exec_time | Time sink in ms | Find top offenders by total DB time |
mean_exec_time, min/max | Averages & spread | Track shifts vs. baseline |
rows | Rows returned | Spot row explosion |
shared_blks_read/hit | Buffer activity | Distinguish I/O-bound vs cache-hot |
temp_blks_read/written | Temp files | Identify sort/hash spilling |
wal_bytes | Write amplification | Evaluate write-heavy patterns |
Versions differ slightly; use your cluster’s columns accordingly.
Create a durable metrics pipeline
1) Snapshot table + view
CREATE SCHEMA IF NOT EXISTS perf;
CREATE TABLE IF NOT EXISTS perf.pgss_snapshot (
captured_at timestamptz NOT NULL DEFAULT now(),
userid oid,
dbid oid,
queryid bigint,
query text,
calls bigint,
total_exec_time double precision,
mean_exec_time double precision,
rows bigint,
shared_blks_hit bigint,
shared_blks_read bigint,
temp_blks_read bigint,
temp_blks_written bigint,
wal_bytes numeric
);
-- Periodic snapshot (e.g., every 5–10 min via cron or pg_cron):
INSERT INTO perf.pgss_snapshot
SELECT now(), userid, dbid, queryid, query, calls,
total_exec_time, mean_exec_time, rows,
shared_blks_hit, shared_blks_read,
temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;
2) Daily rollup for trend analysis
CREATE MATERIALIZED VIEW IF NOT EXISTS perf.pgss_daily AS
SELECT date_trunc('day', captured_at) AS day,
queryid,
max(query) FILTER (WHERE query IS NOT NULL) AS sample_query,
sum(calls) AS calls,
sum(total_exec_time) AS total_exec_ms,
percentile_disc(0.95) WITHIN GROUP (ORDER BY mean_exec_time) AS p95_ms,
sum(shared_blks_read) AS reads,
sum(shared_blks_hit) AS hits,
sum(temp_blks_written) AS temp_writes
FROM perf.pgss_snapshot
GROUP BY 1,2;
-- Refresh nightly
3) Regression detector (week-over-week)
WITH curr AS (
SELECT day, queryid, p95_ms, total_exec_ms
FROM perf.pgss_daily
WHERE day >= date_trunc('week', now()) - interval '0 week'
),
prev AS (
SELECT day, queryid, p95_ms, total_exec_ms
FROM perf.pgss_daily
WHERE day >= date_trunc('week', now()) - interval '1 week'
AND day < date_trunc('week', now())
),
agg AS (
SELECT c.queryid,
percentile_disc(0.95) WITHIN GROUP (ORDER BY c.p95_ms) AS curr_p95,
percentile_disc(0.95) WITHIN GROUP (ORDER BY p.p95_ms) AS prev_p95,
sum(c.total_exec_ms) AS curr_total_ms,
sum(p.total_exec_ms) AS prev_total_ms
FROM curr c JOIN prev p USING (queryid)
GROUP BY 1
)
SELECT queryid,
round((curr_p95 - prev_p95) / NULLIF(prev_p95,0) * 100, 1) AS p95_delta_pct,
round((curr_total_ms - prev_total_ms)/NULLIF(prev_total_ms,0)*100, 1) AS total_ms_delta_pct
FROM agg
WHERE (curr_p95 > prev_p95 * 1.25) OR (curr_total_ms > prev_total_ms * 1.25)
ORDER BY p95_delta_pct DESC NULLS LAST;
Start with a 25% threshold; tune to your noise level.
Make plans tell the truth
When a query flags as regressed, capture a real plan:
EXPLAIN (ANALYZE, BUFFERS, WAL, VERBOSE, TIMING)
SELECT ...
WHERE ...
ORDER BY ...
LIMIT ...;
Read it like a pro:
- Node that dominates time: The tallest bar in the plan tree.
- Rows vs. Rows Removed: Misestimates show up as “expected vs actual” row counts drifting orders of magnitude.
- Loops: Nested Loop with high loop count + seq scan inside = pain.
- Buffers:
shared readspikes = I/O-bound;temp written= spills. - Filters/Join conditions: Look for non-sargable predicates and implicit casts.
Real example: fixing a silent scan
Symptom: WoW detector flags queryid=… with +60% p95 and +90% total_exec_ms.
Plan: Seq Scan on orders with filter created_at >= now() - interval '7 days' and status = 'PAID'::text.
Root cause: Missing composite index + poor selectivity estimate after a data skew.
Fix:
CREATE INDEX CONCURRENTLY idx_orders_paid_recent
ON orders (status, created_at DESC)
INCLUDE (customer_id, total);
Proof: Rerun EXPLAIN (ANALYZE, BUFFERS). Look for Index Only Scan and a 10–100x drop in actual time and shared reads.
Guardrail: Add a test that asserts p95 under your target for a 7-day hot window workload.
SLOs that don’t wiggle
- Per query family (by
queryidor a curated label):- p95 latency (e.g., < 40 ms)
- Total DB time per day (e.g., < 8% of cluster time)
- No temp spills on hot paths
- Cache efficiency: hit/read ratio above X
- Release gate: Reject if any hot query exceeds +25% p95 vs baseline on a 10–15 min synthetic load.
Best practices
- Budget by total time, not just p95. One “fast” but ultra-hot query can eat half your CPU.
- Label query families. Store a curated mapping table
perf.query_labels(queryid, label, owner). - Reset wisely. Use
SELECT pg_stat_statements_reset();after a deploy to isolate effects—but only after snapshotting. - Vacuum/Analyze discipline. Stale stats = wrong plans. Analyze big tables after bulk changes.
- Guard work_mem. Unbounded sorts/hash joins spill; prefer narrower SELECTs and right-sized
work_mem. - Parameter stability. Avoid plan flip-flops: stabilize with
plan_cache_mode = force_custom_planonly for outliers, or rewrite to keep cardinalities predictable. - JIT & parallelism. Test toggles per query. Some OLTP paths hate JIT; some analytics love parallel.
- Composite indexes > Swiss-army knives. Build for access patterns, not for “maybe someday.”
Common pitfalls
- Chasing estimates. The planner’s estimates are hints. Only
ANALYZEtime is truth. - Averages hide pain. Track p95/p99 and total_exec_time.
- Ignoring temp IO. Spills show up as temp blocks and crush latency under load.
- One-shot fire drills. If you don’t snapshot, you can’t prove the fix—or the regression.
- Mystery parameters. Implicit casts (
textvsvarchar,timestampvstimestamptz) can nuke indexes.
Minimal Python collector (optional)
import psycopg2, datetime as dt
INSERT_SQL = """
INSERT INTO perf.pgss_snapshot
(captured_at, userid, dbid, queryid, query, calls, total_exec_time,
mean_exec_time, rows, shared_blks_hit, shared_blks_read, temp_blks_read,
temp_blks_written, wal_bytes)
SELECT now(), userid, dbid, queryid, query, calls, total_exec_time,
mean_exec_time, rows, shared_blks_hit, shared_blks_read,
temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;
"""
with psycopg2.connect(dsn="dbname=app user=collector sslmode=require") as conn:
with conn.cursor() as cur:
cur.execute(INSERT_SQL)
Run every 5–10 minutes via cron/systemd/pg_cron.
Internal link ideas (add on your site)
- “Indexing Strategies that Actually Move the Needle”
- “Vacuum, Autovacuum, and Bloat: A Practical Guide”
- “Connection Pooling with pgbouncer for OLTP at Scale”
- “Write-Path Tuning: Fsync, WAL, and Checkpointing”
Conclusion & Takeaways
- Plan truth > estimate guesses. Always validate with
EXPLAIN (ANALYZE, BUFFERS). pg_stat_statementsturns noise into numbers. Track p95 and total DB time by query family.- Automate snapshots and weekly diffs. Regressions surface themselves.
- Prove every fix. Same query, same input, better plan, lower time—or it didn’t happen.
Call to action: Start by enabling pg_stat_statements today. Ship the snapshot table, add the WoW regression query, and pick one hot query to fix by end of week. Plans don’t lie—make them work for you.
Image prompt
“A clean, modern data architecture dashboard showing a PostgreSQL cluster with pg_stat_statements metrics, and a side-by-side EXPLAIN ANALYZE plan—minimalistic, high contrast, isometric 3D, subtle blue/green palette, crisp labels for p95, total time, buffers.”
Tags
#PostgreSQL #pg_stat_statements #QueryOptimization #EXPLAINANALYZE #DatabasePerformance #SRE #DataEngineering #Latency #Indexes #Observability









Leave a Reply