Query Plans that Don’t Lie: Building a Performance Review with pg_stat_statements
Hook: Your app got slower after the last release, and everyone’s got a theory: “it’s the network,” “it’s JIT,” “it’s that new index.” Opinions don’t fix latency. Plans and measurements do. In PostgreSQL, the combo of pg_stat_statements + disciplined plan reviews gives you a repeatable way to set performance targets, spot regressions, and prove the fix—without guesswork.
Why this matters
- Speed is a feature. Users feel p95, not averages.
- Regressions hide in plain sight. One small query change can dominate total database time.
- Query plans never lie—estimates do. Use
EXPLAIN (ANALYZE, BUFFERS)to validate reality. pg_stat_statementsis your X-ray. It fingerprints queries, aggregates time, I/O, and call counts across the cluster.
Architecture: a lightweight “Performance Review” loop
- Collect: Enable
pg_stat_statements. Snapshot its metrics every N minutes into a “metrics” schema. - Target: Define SLOs per query family (p95 latency, total DB time budget, rows/second).
- Compare: Daily/weekly rollups. Flag WoW deltas > X% in p95 or total_exec_time.
- Inspect: For flagged queries, run
EXPLAIN (ANALYZE, BUFFERS)on a representative input—plans don’t lie. - Fix: Indexing, rewrite, planner nudges, stats refresh, or config changes (work_mem, jit, parallel).
- Prove: Re-run the same plan; watch the metric roll back below target.
- Prevent: Gate PRs with a small load-test plus a SQL baseline comparison.
This loop is simple enough for a single engineer, robust enough for an SRE team.
Setup: turn on pg_stat_statements
-- postgresql.conf (or ALTER SYSTEM SET)
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all -- include nested
pg_stat_statements.track_utility = off -- focus on DML/SELECT
pg_stat_statements.save = on
-- restart or reload as required, then:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Tip: Set
pg_stat_statements.normalize = on(default) so parameterized queries get the same fingerprint.
What to watch (cheat sheet)
| Field (view) | What it tells you | Use it to… |
|---|---|---|
queryid, query | Fingerprint + normalized SQL | Group and label “query families” |
calls | Frequency | Separate “hot” from “rare” |
total_exec_time | Time sink in ms | Find top offenders by total DB time |
mean_exec_time, min/max | Averages & spread | Track shifts vs. baseline |
rows | Rows returned | Spot row explosion |
shared_blks_read/hit | Buffer activity | Distinguish I/O-bound vs cache-hot |
temp_blks_read/written | Temp files | Identify sort/hash spilling |
wal_bytes | Write amplification | Evaluate write-heavy patterns |
Versions differ slightly; use your cluster’s columns accordingly.
Create a durable metrics pipeline
1) Snapshot table + view
CREATE SCHEMA IF NOT EXISTS perf;
CREATE TABLE IF NOT EXISTS perf.pgss_snapshot (
captured_at timestamptz NOT NULL DEFAULT now(),
userid oid,
dbid oid,
queryid bigint,
query text,
calls bigint,
total_exec_time double precision,
mean_exec_time double precision,
rows bigint,
shared_blks_hit bigint,
shared_blks_read bigint,
temp_blks_read bigint,
temp_blks_written bigint,
wal_bytes numeric
);
-- Periodic snapshot (e.g., every 5–10 min via cron or pg_cron):
INSERT INTO perf.pgss_snapshot
SELECT now(), userid, dbid, queryid, query, calls,
total_exec_time, mean_exec_time, rows,
shared_blks_hit, shared_blks_read,
temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;
2) Daily rollup for trend analysis
CREATE MATERIALIZED VIEW IF NOT EXISTS perf.pgss_daily AS
SELECT date_trunc('day', captured_at) AS day,
queryid,
max(query) FILTER (WHERE query IS NOT NULL) AS sample_query,
sum(calls) AS calls,
sum(total_exec_time) AS total_exec_ms,
percentile_disc(0.95) WITHIN GROUP (ORDER BY mean_exec_time) AS p95_ms,
sum(shared_blks_read) AS reads,
sum(shared_blks_hit) AS hits,
sum(temp_blks_written) AS temp_writes
FROM perf.pgss_snapshot
GROUP BY 1,2;
-- Refresh nightly
3) Regression detector (week-over-week)
WITH curr AS (
SELECT day, queryid, p95_ms, total_exec_ms
FROM perf.pgss_daily
WHERE day >= date_trunc('week', now()) - interval '0 week'
),
prev AS (
SELECT day, queryid, p95_ms, total_exec_ms
FROM perf.pgss_daily
WHERE day >= date_trunc('week', now()) - interval '1 week'
AND day < date_trunc('week', now())
),
agg AS (
SELECT c.queryid,
percentile_disc(0.95) WITHIN GROUP (ORDER BY c.p95_ms) AS curr_p95,
percentile_disc(0.95) WITHIN GROUP (ORDER BY p.p95_ms) AS prev_p95,
sum(c.total_exec_ms) AS curr_total_ms,
sum(p.total_exec_ms) AS prev_total_ms
FROM curr c JOIN prev p USING (queryid)
GROUP BY 1
)
SELECT queryid,
round((curr_p95 - prev_p95) / NULLIF(prev_p95,0) * 100, 1) AS p95_delta_pct,
round((curr_total_ms - prev_total_ms)/NULLIF(prev_total_ms,0)*100, 1) AS total_ms_delta_pct
FROM agg
WHERE (curr_p95 > prev_p95 * 1.25) OR (curr_total_ms > prev_total_ms * 1.25)
ORDER BY p95_delta_pct DESC NULLS LAST;
Start with a 25% threshold; tune to your noise level.
Make plans tell the truth
When a query flags as regressed, capture a real plan:
EXPLAIN (ANALYZE, BUFFERS, WAL, VERBOSE, TIMING)
SELECT ...
WHERE ...
ORDER BY ...
LIMIT ...;
Read it like a pro:
- Node that dominates time: The tallest bar in the plan tree.
- Rows vs. Rows Removed: Misestimates show up as “expected vs actual” row counts drifting orders of magnitude.
- Loops: Nested Loop with high loop count + seq scan inside = pain.
- Buffers:
shared readspikes = I/O-bound;temp written= spills. - Filters/Join conditions: Look for non-sargable predicates and implicit casts.
Real example: fixing a silent scan
Symptom: WoW detector flags queryid=… with +60% p95 and +90% total_exec_ms.
Plan: Seq Scan on orders with filter created_at >= now() - interval '7 days' and status = 'PAID'::text.
Root cause: Missing composite index + poor selectivity estimate after a data skew.
Fix:
CREATE INDEX CONCURRENTLY idx_orders_paid_recent
ON orders (status, created_at DESC)
INCLUDE (customer_id, total);
Proof: Rerun EXPLAIN (ANALYZE, BUFFERS). Look for Index Only Scan and a 10–100x drop in actual time and shared reads.
Guardrail: Add a test that asserts p95 under your target for a 7-day hot window workload.
SLOs that don’t wiggle
- Per query family (by
queryidor a curated label):- p95 latency (e.g., < 40 ms)
- Total DB time per day (e.g., < 8% of cluster time)
- No temp spills on hot paths
- Cache efficiency: hit/read ratio above X
- Release gate: Reject if any hot query exceeds +25% p95 vs baseline on a 10–15 min synthetic load.
Best practices
- Budget by total time, not just p95. One “fast” but ultra-hot query can eat half your CPU.
- Label query families. Store a curated mapping table
perf.query_labels(queryid, label, owner). - Reset wisely. Use
SELECT pg_stat_statements_reset();after a deploy to isolate effects—but only after snapshotting. - Vacuum/Analyze discipline. Stale stats = wrong plans. Analyze big tables after bulk changes.
- Guard work_mem. Unbounded sorts/hash joins spill; prefer narrower SELECTs and right-sized
work_mem. - Parameter stability. Avoid plan flip-flops: stabilize with
plan_cache_mode = force_custom_planonly for outliers, or rewrite to keep cardinalities predictable. - JIT & parallelism. Test toggles per query. Some OLTP paths hate JIT; some analytics love parallel.
- Composite indexes > Swiss-army knives. Build for access patterns, not for “maybe someday.”
Common pitfalls
- Chasing estimates. The planner’s estimates are hints. Only
ANALYZEtime is truth. - Averages hide pain. Track p95/p99 and total_exec_time.
- Ignoring temp IO. Spills show up as temp blocks and crush latency under load.
- One-shot fire drills. If you don’t snapshot, you can’t prove the fix—or the regression.
- Mystery parameters. Implicit casts (
textvsvarchar,timestampvstimestamptz) can nuke indexes.
Minimal Python collector (optional)
import psycopg2, datetime as dt
INSERT_SQL = """
INSERT INTO perf.pgss_snapshot
(captured_at, userid, dbid, queryid, query, calls, total_exec_time,
mean_exec_time, rows, shared_blks_hit, shared_blks_read, temp_blks_read,
temp_blks_written, wal_bytes)
SELECT now(), userid, dbid, queryid, query, calls, total_exec_time,
mean_exec_time, rows, shared_blks_hit, shared_blks_read,
temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;
"""
with psycopg2.connect(dsn="dbname=app user=collector sslmode=require") as conn:
with conn.cursor() as cur:
cur.execute(INSERT_SQL)
Run every 5–10 minutes via cron/systemd/pg_cron.
Internal link ideas (add on your site)
- “Indexing Strategies that Actually Move the Needle”
- “Vacuum, Autovacuum, and Bloat: A Practical Guide”
- “Connection Pooling with pgbouncer for OLTP at Scale”
- “Write-Path Tuning: Fsync, WAL, and Checkpointing”
Conclusion & Takeaways
- Plan truth > estimate guesses. Always validate with
EXPLAIN (ANALYZE, BUFFERS). pg_stat_statementsturns noise into numbers. Track p95 and total DB time by query family.- Automate snapshots and weekly diffs. Regressions surface themselves.
- Prove every fix. Same query, same input, better plan, lower time—or it didn’t happen.
Call to action: Start by enabling pg_stat_statements today. Ship the snapshot table, add the WoW regression query, and pick one hot query to fix by end of week. Plans don’t lie—make them work for you.














Leave a Reply