Parallel Execution Deep Dive

Parallel Execution Deep Dive: DOP math, skew handling, and monitoring PX servers like a pro

Hook: Your query “flies” in dev, crawls in prod, and half the PX slaves are idle while one is melting the CPU. That’s not “Oracle being Oracle.” It’s you missing three things: how DOP is actually calculated, how to kill data skew, and how to monitor PX like a surgeon. Let’s fix that.


Why this matters

  • Parallel Execution (PX) is the biggest performance lever you have for large scans/joins.
  • Get DOP wrong → you thrash CPUs, saturate interconnects, and trigger queuing.
  • Ignore skew → one or two slaves do all the work while the rest sip coffee.
  • Skip monitoring → you’ll tune blind and blame the wrong layer.

Concept & architecture (clear mental model)

The PX cast

  • QC (Query Coordinator): Serial session orchestrating the parallel plan.
  • PX Servers: Worker processes in sets (producer/consumer). A 2-set plan often means DOP workers produce rows → redistribute → DOP workers consume.
  • Granules: Units of work (block/partition granules). Good granularity = balanced work.
  • TQ (Table Queue): The “pipe” between producer and consumer sets. Distribution methods: HASH, RANGE, BROADCAST, RANDOM.

Where bottlenecks occur

  • CPU-bound: DOP > usable CPU → run-queue spikes, regressions.
  • Interconnect-bound (RAC): heavy HASH or BROADCAST across instances.
  • I/O-bound: Too few granules or slow storage → PX servers stall.
  • Skew-bound: One key or partition hoards rows → one server becomes the tail.

DOP math (the part people hand-wave and then suffer)

Think of Effective DOP as the minimum of several caps.

Key caps

CapFormula / SettingNotes
CPU capcpu_count * parallel_threads_per_cpu * instancesUpper bound per cluster. Practically lower due to other workloads.
Policy capparallel_degree_limit or AUTO algorithmHard or adaptive cap.
Servers availabilityparallel_max_servers, parallel_servers_targetTarget enforces queuing when reached (AUTO).
Statement/segment DOPObject/statement DOP, hints, AUTO DOPThe DOP you asked for is only a proposal.
Resource ManagerConsumer group directivesCan downscale or queue PX.

Effective DOP = min( requested_DOP, policy_cap, available_servers_cap, cpu_cap, resource_manager_cap )

Practical examples

Example 1: single-instance

  • cpu_count=32, parallel_threads_per_cpu=2 → CPU cap ≈ 64
  • parallel_degree_limit=32 → Policy cap = 32
  • parallel_servers_target=80, currently 64 busy → Avail cap ≈ 16
  • Statement requests DOP 64Effective DOP = min(64, 32, 16, 64) = 16

Example 2: RAC (2 nodes)

  • Each node cpu_count=24, parallel_threads_per_cpu=2 → cluster cap ≈ 96
  • Policy cap AUTO estimates 48 for the SQL.
  • Target nearly reached; only 20 servers free → Effective DOP = 20

Hints and knobs that actually change results

  • Request DOP:
    /*+ parallel(t 16) parallel(j 16) */ or alter session force parallel query 16;
  • Cap DOP:
    alter session set parallel_degree_limit=16;
  • Adaptive DOP & queuing:
    alter system set parallel_degree_policy=AUTO;
    alter system set parallel_min_time_threshold=10; — only parallelize long ops
    alter system set parallel_servers_target=NNN; — start queuing at this level

Brutal truth: If you rely only on hints, Production will still cap/queue you. Know the system caps.


Skew handling like a pro

How to detect skew fast

  • Runtime TQ stats:SELECT dfo_number, tq_id, server_type, MIN(num_rows) min_rows, MAX(num_rows) max_rows, ROUND(MAX(num_rows)/NULLIF(MIN(num_rows),0),2) skew_ratio FROM v$pq_tqstat GROUP BY dfo_number, tq_id, server_type ORDER BY 1,2,3;
    • skew_ratio > 2 is already suspicious; >5 is pain.
  • SQL Monitor (per slave bytes/rows): SELECT * FROM TABLE(DBMS_SQLTUNE.REPORT_SQL_MONITOR(sql_id => '&sql_id', type => 'ACTIVE'));
  • ASH samples per PX server: if one server owns most samples, you’re skewed.

Typical skew causes → surgical fixes

CauseSymptomFix
Hot key (e.g., many rows with status='ACTIVE')One PX server huge, others tinySalting: add pseudo-random suffix to the key before hash-redistribution. Example: `HASH(key
Uneven partitionsOne partition >> othersRe-partition or subpartition heavy partitions; or range-hash composite.
Tiny # of granulesEarly idle slavesIncrease granularity: lower PARALLEL_MIN_PARTITION, enable block granules, or split partitions.
Broadcast of huge tableNetwork/one set overloadedForce HASH join with PQ_DISTRIBUTE hint; ensure join order makes smaller table broadcasted if needed.
Correlated predicatesHash buckets unevenPre-aggregate by key first; or create histograms to guide distribution.
Top-N GROUP BYA few groups dominateTwo-phase aggregation: partial aggregate per slave → rehash on salted key → final aggregate.

Hints you’ll actually use

  • Force distribution:
    /*+ pq_distribute(t hash hash) */ — both sides hash-redistribute
    /*+ pq_distribute(t broadcast none) */ — broadcast t to avoid hash
  • Granule control:
    /*+ qb_name(q) px_granule(block) */ or px_granule(partition)
  • Salt for group-by: SELECT salted_key, SUM(val) FROM ( SELECT (key || ':' || MOD(ORA_HASH(key), 16)) AS salted_key, val FROM fact ) GROUP BY salted_key; -- Then roll up salted_key back to key if needed

Monitoring PX servers (no guesswork)

Quick health checklist

  1. Are we queued? Too many statements at PX Coordinator wait → lower requested DOP or increase parallel_servers_target (if you own the box).
  2. Is DOP what we think? Compare plan DOP vs actual in SQL Monitor.
  3. Is work balanced? Inspect v$pq_tqstat min/max rows and bytes.
  4. Is interconnect hot? In RAC, check GC/Net waits per session and gv$pq_tqstat cross-instance flow.
  5. Are we CPU or I/O bound? Look at DB time split (DB CPU vs cell single block read/direct path read).

Views and scripts you should keep

Active PX processes

SELECT * FROM v$px_process ORDER BY server_name;
SELECT status, COUNT(*) FROM v$px_process GROUP BY status;

PX sessions (who’s working on what)

SELECT s.sid, s.serial#, q.qcinst_id, q.server_set, q.req_degree, q.degree, s.sql_id
FROM   v$px_session q JOIN v$session s ON q.sid = s.sid
ORDER  BY q.server_set, q.req_degree;

TQ stats (distribution balance)

SELECT dfo_number, tq_id, server_type, MIN(bytes) min_b, MAX(bytes) max_b,
       ROUND(MAX(bytes)/NULLIF(MIN(bytes),1),2) skew_ratio
FROM   v$pq_tqstat
GROUP  BY dfo_number, tq_id, server_type
ORDER  BY 1,2,3;

System saturation (queuing threshold)

SELECT name, value FROM v$px_process_sysstat
WHERE  name IN ('Servers In Use', 'Servers Highwater', 'Servers Exhausted');

SQL Monitor link (HTML report)

SELECT DBMS_SQLTUNE.REPORT_SQL_MONITOR(sql_id => '&sql_id', report_level => 'ALL')
FROM   dual;

Tip: Save “golden” reports for good runs. Diff them against regressions to see what changed (DOP, distribution, waits).


Real example: fixing a 30→4 minute hash join

Symptoms

  • Requested DOP 32, but SQL Monitor shows Effective DOP = 12 (target reached).
  • v$pq_tqstat shows max_bytes / min_bytes ≈ 7.9 on a hash TQ → skew.
  • A single status='ACTIVE' chunk dominates one server.

Fix

  1. Add salting on the join key for the large side: /*+ pq_distribute(big hash hash) */ SELECT /*+ parallel(32) */ ... FROM big_fact bf JOIN dim d ON ( (bf.key || ':' || MOD(ORA_HASH(bf.key), 16)) = d.salted_key ) or apply salting only in redistribution subquery + roll-up.
  2. Increase granules by switching to block granules on big table:
    /*+ px_granule(block) */
  3. Lower requested DOP to 24 to avoid queuing thrash.

Result

  • Skew ratio drops to <1.5.
  • Interconnect bytes smooth.
  • Elapsed time 4 min, CPU stable, no queue.

Best practices (and honest pitfalls)

Best practices

  • Size DOP against usable CPU, not theoretical caps. Leave 20–30% headroom.
  • Prefer HASH redistribution for large joins, BROADCAST only if the smaller input is truly small.
  • Keep granules small enough to oversubscribe (more granules than slaves).
  • Use AUTO DOP with parallel_min_time_threshold to avoid trashing short queries.
  • Capture SQL Monitor for every major batch path. Build a simple diff workflow.

Pitfalls

  • Forcing DOP 64 everywhere: You’ll hit queuing and make latency worse.
  • Broadcasting a “small” table that isn’t small in prod: Explodes network + memory.
  • Ignoring histograms: Optimizer guesses uniformity, drives skew.
  • RAC unaware design: Heavy hash redistributions across nodes can be slower than single-instance.

Conclusion & takeaways

  • DOP math is deterministic. Compute your effective DOP from caps before you run.
  • Skew is the silent killer. Detect with v$pq_tqstat and fix with salting, better partitioning, and correct distribution.
  • Monitor like a pro. SQL Monitor + PX views will tell you exactly where the time and bytes go.

Call to action: Grab one painful SQL today, run the monitoring queries, compute its real DOP, and test a skew fix. Screenshot the before/after SQL Monitor—your future self will thank you.


Internal link ideas

  • “Designing Keys to Avoid Skew in Hash Joins”
  • “When to Broadcast vs Hash Redistribute in PX”
  • “Adaptive DOP: Tuning parallel_degree_policy the Right Way”
  • “RAC Interconnect Considerations for PX Queries”

Image prompt

“A clean, modern data architecture diagram showing an Oracle-style PX plan: a QC coordinating two PX server sets with a hash redistribution TQ between them, DOP annotated, and skew highlighted on one worker. Minimalist, high-contrast, isometric 3D, labeled arrows for producer→consumer, with a small inset showing DOP caps (CPU/policy/servers).”


Tags

#ParallelExecution #OracleDatabase #DOP #SQLTuning #DataEngineering #Performance #PX #Skew #RAC #QueryOptimization

Leave a Reply

Your email address will not be published. Required fields are marked *