Query Plans that Don’t Lie

Query Plans that Don’t Lie

Query Plans that Don’t Lie: Building a Performance Review with pg_stat_statements

Hook: Your app got slower after the last release, and everyone’s got a theory: “it’s the network,” “it’s JIT,” “it’s that new index.” Opinions don’t fix latency. Plans and measurements do. In PostgreSQL, the combo of pg_stat_statements + disciplined plan reviews gives you a repeatable way to set performance targets, spot regressions, and prove the fix—without guesswork.


Why this matters

  • Speed is a feature. Users feel p95, not averages.
  • Regressions hide in plain sight. One small query change can dominate total database time.
  • Query plans never lie—estimates do. Use EXPLAIN (ANALYZE, BUFFERS) to validate reality.
  • pg_stat_statements is your X-ray. It fingerprints queries, aggregates time, I/O, and call counts across the cluster.

Architecture: a lightweight “Performance Review” loop

  1. Collect: Enable pg_stat_statements. Snapshot its metrics every N minutes into a “metrics” schema.
  2. Target: Define SLOs per query family (p95 latency, total DB time budget, rows/second).
  3. Compare: Daily/weekly rollups. Flag WoW deltas > X% in p95 or total_exec_time.
  4. Inspect: For flagged queries, run EXPLAIN (ANALYZE, BUFFERS) on a representative input—plans don’t lie.
  5. Fix: Indexing, rewrite, planner nudges, stats refresh, or config changes (work_mem, jit, parallel).
  6. Prove: Re-run the same plan; watch the metric roll back below target.
  7. Prevent: Gate PRs with a small load-test plus a SQL baseline comparison.

This loop is simple enough for a single engineer, robust enough for an SRE team.


Setup: turn on pg_stat_statements

-- postgresql.conf (or ALTER SYSTEM SET)
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all          -- include nested
pg_stat_statements.track_utility = off  -- focus on DML/SELECT
pg_stat_statements.save = on

-- restart or reload as required, then:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Tip: Set pg_stat_statements.normalize = on (default) so parameterized queries get the same fingerprint.


What to watch (cheat sheet)

Field (view)What it tells youUse it to…
queryid, queryFingerprint + normalized SQLGroup and label “query families”
callsFrequencySeparate “hot” from “rare”
total_exec_timeTime sink in msFind top offenders by total DB time
mean_exec_time, min/maxAverages & spreadTrack shifts vs. baseline
rowsRows returnedSpot row explosion
shared_blks_read/hitBuffer activityDistinguish I/O-bound vs cache-hot
temp_blks_read/writtenTemp filesIdentify sort/hash spilling
wal_bytesWrite amplificationEvaluate write-heavy patterns

Versions differ slightly; use your cluster’s columns accordingly.


Create a durable metrics pipeline

1) Snapshot table + view

CREATE SCHEMA IF NOT EXISTS perf;

CREATE TABLE IF NOT EXISTS perf.pgss_snapshot (
  captured_at      timestamptz NOT NULL DEFAULT now(),
  userid           oid,
  dbid             oid,
  queryid          bigint,
  query            text,
  calls            bigint,
  total_exec_time  double precision,
  mean_exec_time   double precision,
  rows             bigint,
  shared_blks_hit  bigint,
  shared_blks_read bigint,
  temp_blks_read   bigint,
  temp_blks_written bigint,
  wal_bytes        numeric
);

-- Periodic snapshot (e.g., every 5–10 min via cron or pg_cron):
INSERT INTO perf.pgss_snapshot
SELECT now(), userid, dbid, queryid, query, calls,
       total_exec_time, mean_exec_time, rows,
       shared_blks_hit, shared_blks_read,
       temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;

2) Daily rollup for trend analysis

CREATE MATERIALIZED VIEW IF NOT EXISTS perf.pgss_daily AS
SELECT date_trunc('day', captured_at) AS day,
       queryid,
       max(query) FILTER (WHERE query IS NOT NULL) AS sample_query,
       sum(calls) AS calls,
       sum(total_exec_time) AS total_exec_ms,
       percentile_disc(0.95) WITHIN GROUP (ORDER BY mean_exec_time) AS p95_ms,
       sum(shared_blks_read) AS reads,
       sum(shared_blks_hit)  AS hits,
       sum(temp_blks_written) AS temp_writes
FROM perf.pgss_snapshot
GROUP BY 1,2;

-- Refresh nightly

3) Regression detector (week-over-week)

WITH curr AS (
  SELECT day, queryid, p95_ms, total_exec_ms
  FROM perf.pgss_daily
  WHERE day >= date_trunc('week', now()) - interval '0 week'
),
prev AS (
  SELECT day, queryid, p95_ms, total_exec_ms
  FROM perf.pgss_daily
  WHERE day >= date_trunc('week', now()) - interval '1 week'
    AND day <  date_trunc('week', now())
),
agg AS (
  SELECT c.queryid,
         percentile_disc(0.95) WITHIN GROUP (ORDER BY c.p95_ms) AS curr_p95,
         percentile_disc(0.95) WITHIN GROUP (ORDER BY p.p95_ms) AS prev_p95,
         sum(c.total_exec_ms) AS curr_total_ms,
         sum(p.total_exec_ms) AS prev_total_ms
  FROM curr c JOIN prev p USING (queryid)
  GROUP BY 1
)
SELECT queryid,
       round((curr_p95 - prev_p95) / NULLIF(prev_p95,0) * 100, 1) AS p95_delta_pct,
       round((curr_total_ms - prev_total_ms)/NULLIF(prev_total_ms,0)*100, 1) AS total_ms_delta_pct
FROM agg
WHERE (curr_p95 > prev_p95 * 1.25) OR (curr_total_ms > prev_total_ms * 1.25)
ORDER BY p95_delta_pct DESC NULLS LAST;

Start with a 25% threshold; tune to your noise level.


Make plans tell the truth

When a query flags as regressed, capture a real plan:

EXPLAIN (ANALYZE, BUFFERS, WAL, VERBOSE, TIMING)
SELECT ...
WHERE ...
ORDER BY ...
LIMIT ...;

Read it like a pro:

  • Node that dominates time: The tallest bar in the plan tree.
  • Rows vs. Rows Removed: Misestimates show up as “expected vs actual” row counts drifting orders of magnitude.
  • Loops: Nested Loop with high loop count + seq scan inside = pain.
  • Buffers: shared read spikes = I/O-bound; temp written = spills.
  • Filters/Join conditions: Look for non-sargable predicates and implicit casts.

Real example: fixing a silent scan

Symptom: WoW detector flags queryid=… with +60% p95 and +90% total_exec_ms.
Plan: Seq Scan on orders with filter created_at >= now() - interval '7 days' and status = 'PAID'::text.

Root cause: Missing composite index + poor selectivity estimate after a data skew.

Fix:

CREATE INDEX CONCURRENTLY idx_orders_paid_recent
ON orders (status, created_at DESC)
INCLUDE (customer_id, total);

Proof: Rerun EXPLAIN (ANALYZE, BUFFERS). Look for Index Only Scan and a 10–100x drop in actual time and shared reads.
Guardrail: Add a test that asserts p95 under your target for a 7-day hot window workload.


SLOs that don’t wiggle

  • Per query family (by queryid or a curated label):
    • p95 latency (e.g., < 40 ms)
    • Total DB time per day (e.g., < 8% of cluster time)
    • No temp spills on hot paths
    • Cache efficiency: hit/read ratio above X
  • Release gate: Reject if any hot query exceeds +25% p95 vs baseline on a 10–15 min synthetic load.

Best practices

  • Budget by total time, not just p95. One “fast” but ultra-hot query can eat half your CPU.
  • Label query families. Store a curated mapping table perf.query_labels(queryid, label, owner).
  • Reset wisely. Use SELECT pg_stat_statements_reset(); after a deploy to isolate effects—but only after snapshotting.
  • Vacuum/Analyze discipline. Stale stats = wrong plans. Analyze big tables after bulk changes.
  • Guard work_mem. Unbounded sorts/hash joins spill; prefer narrower SELECTs and right-sized work_mem.
  • Parameter stability. Avoid plan flip-flops: stabilize with plan_cache_mode = force_custom_plan only for outliers, or rewrite to keep cardinalities predictable.
  • JIT & parallelism. Test toggles per query. Some OLTP paths hate JIT; some analytics love parallel.
  • Composite indexes > Swiss-army knives. Build for access patterns, not for “maybe someday.”

Common pitfalls

  • Chasing estimates. The planner’s estimates are hints. Only ANALYZE time is truth.
  • Averages hide pain. Track p95/p99 and total_exec_time.
  • Ignoring temp IO. Spills show up as temp blocks and crush latency under load.
  • One-shot fire drills. If you don’t snapshot, you can’t prove the fix—or the regression.
  • Mystery parameters. Implicit casts (text vs varchar, timestamp vs timestamptz) can nuke indexes.

Minimal Python collector (optional)

import psycopg2, datetime as dt

INSERT_SQL = """
INSERT INTO perf.pgss_snapshot
(captured_at, userid, dbid, queryid, query, calls, total_exec_time,
 mean_exec_time, rows, shared_blks_hit, shared_blks_read, temp_blks_read,
 temp_blks_written, wal_bytes)
SELECT now(), userid, dbid, queryid, query, calls, total_exec_time,
       mean_exec_time, rows, shared_blks_hit, shared_blks_read,
       temp_blks_read, temp_blks_written, wal_bytes
FROM pg_stat_statements;
"""

with psycopg2.connect(dsn="dbname=app user=collector sslmode=require") as conn:
    with conn.cursor() as cur:
        cur.execute(INSERT_SQL)

Run every 5–10 minutes via cron/systemd/pg_cron.


Internal link ideas (add on your site)

  • “Indexing Strategies that Actually Move the Needle”
  • “Vacuum, Autovacuum, and Bloat: A Practical Guide”
  • “Connection Pooling with pgbouncer for OLTP at Scale”
  • “Write-Path Tuning: Fsync, WAL, and Checkpointing”

Conclusion & Takeaways

  • Plan truth > estimate guesses. Always validate with EXPLAIN (ANALYZE, BUFFERS).
  • pg_stat_statements turns noise into numbers. Track p95 and total DB time by query family.
  • Automate snapshots and weekly diffs. Regressions surface themselves.
  • Prove every fix. Same query, same input, better plan, lower time—or it didn’t happen.

Call to action: Start by enabling pg_stat_statements today. Ship the snapshot table, add the WoW regression query, and pick one hot query to fix by end of week. Plans don’t lie—make them work for you.

Leave a Reply

Your email address will not be published. Required fields are marked *