From Heap to Healthy

From Heap to Healthy: Vacuum, Bloat, and MVCC Explained for Humans

practical autovacuum tuning with measurable thresholds

Introduction — why you should care

If your Postgres is “mysteriously” slowing down, growing without bound, or choking on simple queries, it’s probably not cursed. It’s bloat. Under MVCC, deleted/updated rows hang around as dead tuples until VACUUM reclaims them. Ignore this and your heap swells, indexes rot, and cache hit ratios tank. The fix isn’t magic—it’s disciplined autovacuum tuning with hard numbers.

This guide cuts the fluff. You’ll learn how MVCC creates bloat, why VACUUM isn’t optional, and exactly when to vacuum—using thresholds you can measure and justify.


MVCC in 90 seconds (for real)

  • MVCC: Every update creates a new row version; old versions stick around to preserve readers’ snapshots.
  • Dead tuples: Old row versions that no active transaction can see. They waste heap space and mislead indexes.
  • Visibility map (VM): Tracks pages that contain only tuples visible to all. Helps index-only scans and VACUUM skip-clean.
  • Free space map (FSM): Where Postgres remembers reusable space.

Key truth: VACUUM doesn’t compact files—it marks space reusable. File shrinkage requires VACUUM FULL (a rewrite) or CLUSTER.


The three tools: which one, when?

ActionWhat it doesProsConsUse when
VACUUMCleans dead tuples, updates VM/FSMOnline, fastDoesn’t shrink fileRoutine maintenance
VACUUM (VERBOSE, ANALYZE)Clean + refresh statsKeeps planner smartSlightly more workAfter large churn
VACUUM FULLRewrites table, compacts fileFrees diskExclusive lock; slowOnce-off after massive deletes
REINDEXRebuilds index bloatRestores index sizeLocks each indexHigh-churn, hot-update tables

How autovacuum actually triggers (the math)

Autovacuum fires when dead tuples exceed:

vacuum threshold = autovacuum_vacuum_threshold
                   + autovacuum_vacuum_scale_factor * reltuples

Analyze fires when row churn exceeds:

analyze threshold = autovacuum_analyze_threshold
                    + autovacuum_analyze_scale_factor * reltuples

Defaults (often too lax for busy OLTP):

  • autovacuum_vacuum_threshold = 50
  • autovacuum_vacuum_scale_factor = 0.2 (20% of table)
  • autovacuum_analyze_threshold = 50
  • autovacuum_analyze_scale_factor = 0.1 (10% of table)

Blunt read: Waiting for 20% of a 50M-row table to die means 10M dead tuples. That’s negligence.


Production-grade thresholds (copy/paste + explain)

Use smaller scale factors for hot tables and larger ones for cold, append-only sets.

Rule of thumb by table size (OLTP):

  • Small (<5M rows): scale_factor = 0.05, threshold = 5,000
  • Medium (5–50M): scale_factor = 0.02, threshold = 20,000
  • Large (>50M): scale_factor = 0.005, threshold = 50,000

Per-table settings (storage parameters):

ALTER TABLE public.orders
  SET (autovacuum_vacuum_scale_factor = 0.02,   -- 2% of rows
       autovacuum_vacuum_threshold     = 20000, -- + 20k floor
       autovacuum_analyze_scale_factor = 0.01,  -- 1% churn
       autovacuum_analyze_threshold    = 10000, -- + 10k floor
       autovacuum_vacuum_cost_limit    = 2000,  -- be less throttled
       autovacuum_vacuum_cost_delay    = 2);    -- ms; tune to reduce IO spikes

When to go even lower:

  • Tables updated >10% daily
  • Heavy index-only scans (protect the VM)
  • High delete bursts (TTL/soft-deletes)

Quick health checks you can run today

1) Where autovacuum is falling behind

SELECT
  now() - last_autovacuum  AS since_vacuum,
  n_dead_tup,
  relname,
  reltuples::bigint,
  pg_size_pretty(pg_table_size(relid)) AS tbl_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Interpretation: If n_dead_tup is already above your computed threshold (formula above), it’s late.

2) Estimated table bloat (no extensions)

SELECT
  c.relname,
  pg_table_size(c.oid)                       AS table_bytes,
  pg_relation_size(c.oid)                    AS main_fork_bytes,
  pg_table_size(c.oid) - pg_relation_size(c.oid) AS approx_bloat_bytes
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog','information_schema')
  AND c.relkind = 'r'
ORDER BY approx_bloat_bytes DESC
LIMIT 20;

Not perfect, but directionally useful. For accuracy, use pgstattuple.

3) Index bloat suspects

SELECT
  schemaname, relname AS table_name, indexrelname AS index_name,
  idx_scan, idx_tup_read, idx_tup_fetch,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

If an index is huge and idx_scan is near zero, rethink it; if it’s hot with many updates, schedule REINDEX CONCURRENTLY.


Worked example (numbers you can sanity-check)

Say orders has 40M rows (reltuples=40,000,000). With defaults, vacuum waits for
50 + 0.2 * 40M = 8,000,050 dead tuples. Bad.

Apply medium policy:

  • scale = 0.02, threshold = 20,000
  • Vacuum when 20,000 + 0.02 * 40M = 820,000 dead tuples

That’s a ~10× earlier cleanup, which:

  • Shrinks heap churn, keeps VM accurate
  • Preserves cache locality
  • Prevents runaway index bloat

You’ll see steadier latency and fewer autovacuum panics.


Architecture & scheduling that actually works

Global knobs (don’t starve the cleaner)

# postgresql.conf
autovacuum = on
autovacuum_vacuum_cost_limit = 2000     # raise if IO budget allows
autovacuum_vacuum_cost_delay = 2ms      # reduce if falling behind
autovacuum_naptime = 10s                # scan more often (default 60s)
maintenance_work_mem = 2GB              # fast vacuum & reindex (per worker)

Concurrency

  • Increase autovacuum_max_workers to match CPU/IO headroom (e.g., 6–10 on beefy boxes).
  • If workers fight with OLTP, schedule heavier work off-peak and keep routine vacuum cheap/frequent.

Avoiding wraparound disasters

Always leave vacuum_freeze_min_age and vacuum_freeze_table_age at sane defaults unless you understand the trade-offs. If autovacuum warns about wraparound, stop everything and vacuum—this is not optional.


Best practices (short, sharp)

  • Lower scale factors on hot tables. Start 0.01–0.02; measure and iterate.
  • Pinpoint per-table tuning. Don’t over-tighten globally; cost lives on hot paths.
  • Protect the VM. Frequent vacuum keeps index-only scans fast.
  • Control deletes. Batch large deletes; consider partitioning + DROP PARTITION over row-by-row purges.
  • Measure, don’t guess. Track n_dead_tup, last_autovacuum, sizes, and IO.
  • Reindex periodically. Especially for wide/updated indexes. Use REINDEX CONCURRENTLY.
  • Use partitions for TTL data. Vacuum won’t save you from abusive churn patterns.
  • After massive churn: VACUUM (ANALYZE) immediately to fix planner stats.

Common pitfalls (seen in the wild)

  • Default 0.2 scale factor on big tables. It’s a time bomb.
  • Assuming VACUUM shrinks files. It doesn’t. Plan occasional VACUUM FULL or partition maintenance.
  • Starving autovacuum. Too few workers or low cost limit = backlog.
  • Ignoring indexes. Table looks fine; indexes are 3× the heap. Reindex them.
  • Running VACUUM FULL at noon. It locks; do it off-hours and communicate.

Minimal scripts you’ll actually use

1) Compute table-specific thresholds and show “late” tables

WITH stats AS (
  SELECT relid, relname, reltuples::bigint, n_dead_tup
  FROM pg_stat_user_tables
),
params AS (
  SELECT  -- per-table overrides via reloptions if present; fallback to globals
    (current_setting('autovacuum_vacuum_threshold'))::int  AS base_vac,
    (current_setting('autovacuum_vacuum_scale_factor'))::float AS scale_vac
)
SELECT s.relname,
       s.reltuples,
       s.n_dead_tup,
       (p.base_vac + p.scale_vac * s.reltuples)::bigint AS vac_trigger,
       (s.n_dead_tup - (p.base_vac + p.scale_vac * s.reltuples))::bigint AS overdue_by
FROM stats s CROSS JOIN params p
WHERE s.n_dead_tup > (p.base_vac + p.scale_vac * s.reltuples)
ORDER BY overdue_by DESC
LIMIT 20;

2) One-off: aggressively clean a hot table (online)

VACUUM (VERBOSE, ANALYZE) public.orders;
-- If indexes look inflated:
REINDEX INDEX CONCURRENTLY public.orders_created_at_idx;

3) Post-delete compaction (use sparingly; locks)

VACUUM FULL public.event_log;
ANALYZE public.event_log;

Tuning checklist you can run this week

  • Identify top 10 churn tables (n_dead_tup & write TPS).
  • Apply per-table scale factors (0.5–2% typical) and thresholds (10–50k).
  • Raise autovacuum_max_workers and autovacuum_vacuum_cost_limit.
  • Add dashboards for n_dead_tup, last_autovacuum, table/index sizes.
  • Partition high-delete/TTL tables; replace deletes with dropping old partitions.
  • Schedule periodic REINDEX CONCURRENTLY for hot indexes.
  • Document the policy. Enforce in migrations (DDL scripts with reloptions).

Internal link ideas

  • “Postgres Partitioning for TTL and Cost Control”
  • “Designing Indexes for High-Churn OLTP Tables”
  • “When to Use REINDEX CONCURRENTLY vs VACUUM FULL
  • “MVCC Deep Dive: Snapshots, Free Space Map, and Visibility Map”
  • “Autovacuum Dashboards: What to Alert On (and Why)”

Conclusion + Takeaways

MVCC is your friend—if you vacuum early and often. Don’t let default thresholds decide your latency. Put numbers on the wall: compute triggers, monitor dead tuples, tune per table, and keep index bloat in check. Your heap stays healthy, queries stay boringly fast, and growth stays predictable.

Call to action: pick your top 3 churn tables, apply the per-table policy above, and re-run the “late tables” query. If overdue, fix the knobs today—not after the next incident.


Image prompt (for DALL·E/Midjourney)

“A clean, modern Postgres maintenance diagram showing MVCC row versions, dead tuples, and an autovacuum worker cleaning pages. Include a table heap, visibility map, and indexes; depict thresholds as gauges. Minimalistic, high-contrast, 3D isometric style.”

Tags

#PostgreSQL #MVCC #Autovacuum #DatabasePerformance #Vacuum #IndexBloat #OLTP #DataEngineering #DBA #Scalability

Leave a Reply

Your email address will not be published. Required fields are marked *