Parallel Query and `work_mem`: Making Postgres Use All Those Cores

When your dashboard query takes 18 seconds on a 16-core box, you’re not CPU-bound—you’re under-tuned. Postgres can parallelize scans, joins, and aggregates, but it won’t do it aggressively unless you set the right knobs and give operators enough memory. This guide shows the exact GUCs to change, what they do, and before/after EXPLAIN (ANALYZE, BUFFERS) plans so you can copy/paste and see the difference today.

Why this matters

Modern VMs have 8–64 vCPUs; a single serial plan wastes most of them.
Under-sized work_mem forces disk sorts/hash spills and kills parallel speedups.
Parallelism only kicks in if the planner thinks it’s worth the setup cost. Your settings teach it to be bolder.

The core concepts (fast and clear)

1) Parallel query basics

Postgres uses a leader process plus parallel workers. Operators that can go parallel include:

Parallel Seq Scan
Parallel Bitmap Heap Scan
Parallel Hash Join / Hash Aggregate
Gather / Gather Merge to recombine results

Parallelism triggers when:

The relation is big enough (thresholds below).
The planner’s parallel costs justify worker startup.
The query shape supports parallel-safe functions.

2) `work_mem` is per operator, per worker

work_mem isn’t global. A single query can allocate multiple work areas times the number of parallel workers. If you set work_mem = 256MB and the plan runs 6 parallel hash joins, you can exhaust RAM fast. Balance is everything.

The practical knobs (what to actually set)

Knob	What it does	Typical starting point*	Notes
`max_parallel_workers`	Global pool of parallel workers	`min(#cores, 2 × #cores_logical/3)`	Don’t exceed core count.
`max_parallel_workers_per_gather`	Per-operator worker cap	`4–8`	Bigger tables ⇒ more workers help up to memory/I/O limits.
`parallel_leader_participation`	Let leader do work	`on`	Usually a free ~10–20% boost.
`min_parallel_table_scan_size`	Minimum table size to consider parallel scan	`32MB–128MB`	Lower it to encourage parallel scans on mid-sized tables.
`min_parallel_index_scan_size`	Same for index scans	`16MB–64MB`	Similar idea.
`parallel_setup_cost`	Cost penalty to start parallel workers	`100–500`	Reduce to be more eager.
`parallel_tuple_cost`	Per-tuple cost in parallel paths	`0.01–0.03`	Reduce slightly to tip plans toward parallel.
`work_mem`	Memory per sort/hash operator per worker	`32–128MB`	Enough to avoid spills; see sizing math below.

*Use these as starting points, then measure. Defaults vary by version and distro; always SHOW your current values.

Session-only testing (safe):

-- In psql or your session:
SET max_parallel_workers = 16;
SET max_parallel_workers_per_gather = 8;
SET parallel_leader_participation = on;

SET min_parallel_table_scan_size = '64MB';
SET min_parallel_index_scan_size = '32MB';

SET parallel_setup_cost = 100;
SET parallel_tuple_cost = 0.01;

SET work_mem = '64MB';

Before/After: real plan patterns you should expect

We’ll use a realistic rollup over a big fact table:

-- 200M-row events, daily rollup last 30 days
EXPLAIN (ANALYZE, BUFFERS)
SELECT date_trunc('day', created_at) AS day,
       count(*) AS n
FROM events
WHERE created_at >= now() - interval '30 days'
GROUP BY 1
ORDER BY 1;

Before (serial, spilling)

You’ll often see:

Finalize GroupAggregate
  ->  Sort  (actual time=...; Disk: 4 x temporary files)
        ->  HashAggregate
              ->  Seq Scan on events  (rows=30,000,000)
Planning Time: ...
Execution Time: 18.2 s

Red flags:

Seq Scan not parallel.
Sort with Disk: temp files ⇒ work_mem too small.
No Gather anywhere.

After (parallel hash aggregate, no spills)

With the session settings above:

Gather Merge  (workers planned: 8, launched: 8)
  ->  Incremental Sort
        ->  Parallel HashAggregate
              ->  Parallel Seq Scan on events
Planning Time: ...
Execution Time: 4.1 s
Buffers: shared hit=..., read=...

Green flags:

Parallel Seq Scan + 8 workers.
Parallel HashAggregate avoids disk.
Gather Merge combines worker output efficiently.

Expect 2–6× speedups for scan/aggregate heavy queries on healthy storage. Diminishing returns beyond 8–12 workers are normal due to memory, I/O, and gather overhead.

Another common win: star-schema join

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.country, count(*) AS purchases
FROM fact_purchases p
JOIN dim_users u ON p.user_id = u.id
WHERE p.created_at >= date_trunc('month', now())
GROUP BY u.country;

Bad pattern:

Hash Join (spill)
  ->  Seq Scan on fact_purchases
  ->  Hash on dim_users  (Disk: temp files)
Execution Time: 12.7 s

Good pattern after tuning:

Gather
  ->  Parallel HashAggregate
        ->  Parallel Hash Join
             -> Parallel Seq Scan on fact_purchases
             -> Parallel Hash  (dim_users fits in memory)
Execution Time: 3.9 s

What changed? Larger work_mem let the hash tables stay in memory, and the planner chose Parallel Hash Join. The Gather overhead is tiny compared to the savings.

Sizing `work_mem` without blowing up RAM

Rule of thumb:

Peak RAM ≈ concurrent_active_queries
         × parallel_workers_per_query
         × parallel_work_areas_per_worker
         × work_mem

concurrent_active_queries: not connections; think queries doing heavy ops at once.
parallel_work_areas_per_worker: usually 1–3 (hash join + aggregate + sort).
Start with work_mem = 32–64MB for mixed workloads; go 128MB for batch/reporting windows.
Use pg_stat_statements + logs to confirm spills:

SET log_temp_files = 0;        -- log any temp file usage
-- run workload
-- check logs for "temporary file" lines (spills)

If you see frequent temp files, bump work_mem slowly (2× at a time) and re-measure.

Quick win checklist (copy/paste)

Discover current reality SHOW max_parallel_workers; SHOW max_parallel_workers_per_gather; SHOW parallel_leader_participation; SHOW min_parallel_table_scan_size; SHOW parallel_setup_cost; SHOW parallel_tuple_cost; SHOW work_mem;
Encourage parallelism SET max_parallel_workers = 16; -- match-ish to CPU SET max_parallel_workers_per_gather = 8; SET parallel_leader_participation = on; SET min_parallel_table_scan_size = '64MB'; SET min_parallel_index_scan_size = '32MB'; SET parallel_setup_cost = 100; SET parallel_tuple_cost = 0.01;
Give operators room SET work_mem = '64MB';
Re-plan and measure
- EXPLAIN (ANALYZE, BUFFERS) on your top N slow queries.
- Look for Parallel operators and no temp disk notes.
- Track Execution Time, Rows, and Buffers deltas.
Promote to postgresql.conf once stable.
- Reconsider values for production concurrency.
- Monitor with pg_stat_bgwriter, pg_stat_statements, and system RAM.

Best practices & common pitfalls

Best practices

Target the top queries. Don’t global-tune blindly. Fix the 80/20.
Partition large facts by time to keep working sets warm and parallelizable.
Indexes still matter. Parallel scans don’t replace proper index strategies.
Use Gather Merge wisely. Sorting per worker + merge can beat sorting after gather.

Pitfalls

RAM blowups from aggressive work_mem × many workers × many queries.
Tiny tables won’t go parallel unless you lower the min_parallel_* thresholds.
Parallel-unsafe functions (certain PL/pgSQL or volatile functions) force serial plans.
I/O bottlenecks. If storage is slow, more workers can dogpile and thrash.

Internal link ideas (for your site)

“Understanding Hash Joins and When They Spill”
“Effective autovacuum for Large Fact Tables”
“Partitioning Strategies for Time-Series in Postgres”
“From OLTP to Analytics: When to Add a Columnar Sidecar (ClickHouse/duckdb)”

Conclusion & takeaways

Parallel query in Postgres isn’t magic—it’s a negotiation between planner costs, object sizes, and memory. Nudge the planner with realistic parallel costs, cap worker counts sensibly, and size work_mem so hash/sort operators stay in RAM. Measure with EXPLAIN (ANALYZE, BUFFERS). Expect 2–6× gains on scan/aggregate heavy queries when you get it right.

Your immediate next step: test the session settings above on one heavy report query, verify Parallel operators appear, and confirm no temp files. Then tune upward, one notch at a time.

Image prompt

“A clean, modern Postgres architecture diagram showing a leader process coordinating 8 parallel workers scanning a large table, parallel hash aggregate, and a Gather Merge node—minimalistic, high-contrast, 3D isometric style with labeled operators.”

Data/ML Engineer Blog