From Heap to Healthy: Vacuum, Bloat, and MVCC Explained for Humans
practical autovacuum tuning with measurable thresholds
Introduction — why you should care
If your Postgres is “mysteriously” slowing down, growing without bound, or choking on simple queries, it’s probably not cursed. It’s bloat. Under MVCC, deleted/updated rows hang around as dead tuples until VACUUM reclaims them. Ignore this and your heap swells, indexes rot, and cache hit ratios tank. The fix isn’t magic—it’s disciplined autovacuum tuning with hard numbers.
This guide cuts the fluff. You’ll learn how MVCC creates bloat, why VACUUM isn’t optional, and exactly when to vacuum—using thresholds you can measure and justify.
MVCC in 90 seconds (for real)
- MVCC: Every update creates a new row version; old versions stick around to preserve readers’ snapshots.
- Dead tuples: Old row versions that no active transaction can see. They waste heap space and mislead indexes.
- Visibility map (VM): Tracks pages that contain only tuples visible to all. Helps index-only scans and VACUUM skip-clean.
- Free space map (FSM): Where Postgres remembers reusable space.
Key truth: VACUUM doesn’t compact files—it marks space reusable. File shrinkage requires VACUUM FULL (a rewrite) or CLUSTER.
The three tools: which one, when?
| Action | What it does | Pros | Cons | Use when |
|---|---|---|---|---|
VACUUM | Cleans dead tuples, updates VM/FSM | Online, fast | Doesn’t shrink file | Routine maintenance |
VACUUM (VERBOSE, ANALYZE) | Clean + refresh stats | Keeps planner smart | Slightly more work | After large churn |
VACUUM FULL | Rewrites table, compacts file | Frees disk | Exclusive lock; slow | Once-off after massive deletes |
REINDEX | Rebuilds index bloat | Restores index size | Locks each index | High-churn, hot-update tables |
How autovacuum actually triggers (the math)
Autovacuum fires when dead tuples exceed:
vacuum threshold = autovacuum_vacuum_threshold
+ autovacuum_vacuum_scale_factor * reltuples
Analyze fires when row churn exceeds:
analyze threshold = autovacuum_analyze_threshold
+ autovacuum_analyze_scale_factor * reltuples
Defaults (often too lax for busy OLTP):
autovacuum_vacuum_threshold = 50autovacuum_vacuum_scale_factor = 0.2(20% of table)autovacuum_analyze_threshold = 50autovacuum_analyze_scale_factor = 0.1(10% of table)
Blunt read: Waiting for 20% of a 50M-row table to die means 10M dead tuples. That’s negligence.
Production-grade thresholds (copy/paste + explain)
Use smaller scale factors for hot tables and larger ones for cold, append-only sets.
Rule of thumb by table size (OLTP):
- Small (<5M rows):
scale_factor = 0.05,threshold = 5,000 - Medium (5–50M):
scale_factor = 0.02,threshold = 20,000 - Large (>50M):
scale_factor = 0.005,threshold = 50,000
Per-table settings (storage parameters):
ALTER TABLE public.orders
SET (autovacuum_vacuum_scale_factor = 0.02, -- 2% of rows
autovacuum_vacuum_threshold = 20000, -- + 20k floor
autovacuum_analyze_scale_factor = 0.01, -- 1% churn
autovacuum_analyze_threshold = 10000, -- + 10k floor
autovacuum_vacuum_cost_limit = 2000, -- be less throttled
autovacuum_vacuum_cost_delay = 2); -- ms; tune to reduce IO spikes
When to go even lower:
- Tables updated >10% daily
- Heavy index-only scans (protect the VM)
- High delete bursts (TTL/soft-deletes)
Quick health checks you can run today
1) Where autovacuum is falling behind
SELECT
now() - last_autovacuum AS since_vacuum,
n_dead_tup,
relname,
reltuples::bigint,
pg_size_pretty(pg_table_size(relid)) AS tbl_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
Interpretation: If n_dead_tup is already above your computed threshold (formula above), it’s late.
2) Estimated table bloat (no extensions)
SELECT
c.relname,
pg_table_size(c.oid) AS table_bytes,
pg_relation_size(c.oid) AS main_fork_bytes,
pg_table_size(c.oid) - pg_relation_size(c.oid) AS approx_bloat_bytes
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog','information_schema')
AND c.relkind = 'r'
ORDER BY approx_bloat_bytes DESC
LIMIT 20;
Not perfect, but directionally useful. For accuracy, use pgstattuple.
3) Index bloat suspects
SELECT
schemaname, relname AS table_name, indexrelname AS index_name,
idx_scan, idx_tup_read, idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
If an index is huge and idx_scan is near zero, rethink it; if it’s hot with many updates, schedule REINDEX CONCURRENTLY.
Worked example (numbers you can sanity-check)
Say orders has 40M rows (reltuples=40,000,000). With defaults, vacuum waits for50 + 0.2 * 40M = 8,000,050 dead tuples. Bad.
Apply medium policy:
- scale = 0.02, threshold = 20,000
- Vacuum when
20,000 + 0.02 * 40M = 820,000dead tuples
That’s a ~10× earlier cleanup, which:
- Shrinks heap churn, keeps VM accurate
- Preserves cache locality
- Prevents runaway index bloat
You’ll see steadier latency and fewer autovacuum panics.
Architecture & scheduling that actually works
Global knobs (don’t starve the cleaner)
# postgresql.conf
autovacuum = on
autovacuum_vacuum_cost_limit = 2000 # raise if IO budget allows
autovacuum_vacuum_cost_delay = 2ms # reduce if falling behind
autovacuum_naptime = 10s # scan more often (default 60s)
maintenance_work_mem = 2GB # fast vacuum & reindex (per worker)
Concurrency
- Increase
autovacuum_max_workersto match CPU/IO headroom (e.g., 6–10 on beefy boxes). - If workers fight with OLTP, schedule heavier work off-peak and keep routine vacuum cheap/frequent.
Avoiding wraparound disasters
Always leave vacuum_freeze_min_age and vacuum_freeze_table_age at sane defaults unless you understand the trade-offs. If autovacuum warns about wraparound, stop everything and vacuum—this is not optional.
Best practices (short, sharp)
- Lower scale factors on hot tables. Start 0.01–0.02; measure and iterate.
- Pinpoint per-table tuning. Don’t over-tighten globally; cost lives on hot paths.
- Protect the VM. Frequent vacuum keeps index-only scans fast.
- Control deletes. Batch large deletes; consider partitioning +
DROP PARTITIONover row-by-row purges. - Measure, don’t guess. Track
n_dead_tup,last_autovacuum, sizes, and IO. - Reindex periodically. Especially for wide/updated indexes. Use
REINDEX CONCURRENTLY. - Use partitions for TTL data. Vacuum won’t save you from abusive churn patterns.
- After massive churn:
VACUUM (ANALYZE)immediately to fix planner stats.
Common pitfalls (seen in the wild)
- Default 0.2 scale factor on big tables. It’s a time bomb.
- Assuming VACUUM shrinks files. It doesn’t. Plan occasional
VACUUM FULLor partition maintenance. - Starving autovacuum. Too few workers or low cost limit = backlog.
- Ignoring indexes. Table looks fine; indexes are 3× the heap. Reindex them.
- Running
VACUUM FULLat noon. It locks; do it off-hours and communicate.
Minimal scripts you’ll actually use
1) Compute table-specific thresholds and show “late” tables
WITH stats AS (
SELECT relid, relname, reltuples::bigint, n_dead_tup
FROM pg_stat_user_tables
),
params AS (
SELECT -- per-table overrides via reloptions if present; fallback to globals
(current_setting('autovacuum_vacuum_threshold'))::int AS base_vac,
(current_setting('autovacuum_vacuum_scale_factor'))::float AS scale_vac
)
SELECT s.relname,
s.reltuples,
s.n_dead_tup,
(p.base_vac + p.scale_vac * s.reltuples)::bigint AS vac_trigger,
(s.n_dead_tup - (p.base_vac + p.scale_vac * s.reltuples))::bigint AS overdue_by
FROM stats s CROSS JOIN params p
WHERE s.n_dead_tup > (p.base_vac + p.scale_vac * s.reltuples)
ORDER BY overdue_by DESC
LIMIT 20;
2) One-off: aggressively clean a hot table (online)
VACUUM (VERBOSE, ANALYZE) public.orders;
-- If indexes look inflated:
REINDEX INDEX CONCURRENTLY public.orders_created_at_idx;
3) Post-delete compaction (use sparingly; locks)
VACUUM FULL public.event_log;
ANALYZE public.event_log;
Tuning checklist you can run this week
- Identify top 10 churn tables (
n_dead_tup& write TPS). - Apply per-table scale factors (0.5–2% typical) and thresholds (10–50k).
- Raise
autovacuum_max_workersandautovacuum_vacuum_cost_limit. - Add dashboards for
n_dead_tup,last_autovacuum, table/index sizes. - Partition high-delete/TTL tables; replace deletes with dropping old partitions.
- Schedule periodic
REINDEX CONCURRENTLYfor hot indexes. - Document the policy. Enforce in migrations (DDL scripts with reloptions).
Internal link ideas
- “Postgres Partitioning for TTL and Cost Control”
- “Designing Indexes for High-Churn OLTP Tables”
- “When to Use
REINDEX CONCURRENTLYvsVACUUM FULL” - “MVCC Deep Dive: Snapshots, Free Space Map, and Visibility Map”
- “Autovacuum Dashboards: What to Alert On (and Why)”
Conclusion + Takeaways
MVCC is your friend—if you vacuum early and often. Don’t let default thresholds decide your latency. Put numbers on the wall: compute triggers, monitor dead tuples, tune per table, and keep index bloat in check. Your heap stays healthy, queries stay boringly fast, and growth stays predictable.
Call to action: pick your top 3 churn tables, apply the per-table policy above, and re-run the “late tables” query. If overdue, fix the knobs today—not after the next incident.
Image prompt (for DALL·E/Midjourney)
“A clean, modern Postgres maintenance diagram showing MVCC row versions, dead tuples, and an autovacuum worker cleaning pages. Include a table heap, visibility map, and indexes; depict thresholds as gauges. Minimalistic, high-contrast, 3D isometric style.”
Tags
#PostgreSQL #MVCC #Autovacuum #DatabasePerformance #Vacuum #IndexBloat #OLTP #DataEngineering #DBA #Scalability








Leave a Reply