Tuning Doris for Spiky Traffic: Nereids knobs, pipeline parallelism, and MV selection heuristics
Meta description (158 chars):
Practical Apache Doris tuning for traffic spikes: Nereids settings, pipeline parallelism, async MV strategies, and workload isolation to keep p95 low and stable.
The spike problem (and why Doris can win)
Traffic spikes are messy: product launches, ad bursts, or “one dashboard to rule them all.” You need sub-second reads while writes keep streaming in. Apache Doris gives you a fighting chance with a cost-based optimizer (Nereids), a pipeline execution engine that saturates cores without thread explosions, and async materialized views (MVs) for transparent query rewrites. Your job: point these tools at the right bottlenecks. (Apache Doris)
What to tune first (order of operations)
- Verify good plans before cranking parallelism. Use
EXPLAIN/Profile; bad plans at 32× parallelism are still bad. Then collect stats.EXPLAIN VERBOSE /* your query */; ANALYZE TABLE fact_orders WITH SAMPLE PERCENT 10; SHOW COLUMN STATS fact_orders;Nereids uses stats;ANALYZEfills them, andEXPLAINshows the plan you’re actually running. (Apache Doris) - Let Doris’ defaults carry you. The pipeline engine is on; Nereids is on by default since 2.0. Don’t toggle them off unless you know why. (Apache Doris)
- Adjust parallelism where it pays. Use query/session hints first, not cluster-wide switches. Default
parallel_pipeline_task_num = 0lets Doris adapt (≈ half cores/BE). (Apache Doris)
Nereids: practical knobs that matter
- Enable the modern optimizer & DML path (on by default):
enable_nereids_planner=TRUE,enable_nereids_dml=TRUE. Use them with current releases. (Apache Doris) - Runtime filters: keep
runtime_filter_mode=GLOBALunless profiling shows RF wait time pressure; then tune RF wait behavior. (Apache Doris) - Stats, always: after heavy backfills or schema shifts, re-
ANALYZEor schedule auto-collection; Nereids’ cost model needs fresh row counts and NDVs to pick joins/MVs well. (Apache Doris)
Pipeline parallelism: fast, safe adjustments
Doris’ pipeline engine scales per BE with controlled threads. Start at defaults; pin overrides to SQL with SET_VAR hints so you don’t penalize other workloads. (Apache Doris)
How to set it (lowest blast radius first)
-- Per-query (recommended during spikes)
SELECT /*+ SET_VAR("parallel_pipeline_task_num=4") */
...
FROM ...
-- Session scope for a batch of reports
SET parallel_pipeline_task_num = 4;
-- As a last resort, global (consider a change window)
SET GLOBAL parallel_pipeline_task_num = 4;
The variable controls the number of worker tasks per fragment on each BE; default 0 means “adaptive” (≈ half the BE cores). In simple point lookups, set =1; in CPU-bound big joins/aggregations, raise carefully (e.g., 8–16) and watch Profile + CPU. Since 2.1, scan parallelism can exceed the number of tablets for Duplicate/Unique-MOW tables. (Apache Doris)
Async MVs: selection heuristics that actually work
Doris rewrites SPJG-shape queries to the best matching MV via cost-based selection. That means you should design a small set of purpose-built MVs that mirror your real query patterns (filters, joins, group-bys). (Apache Doris)
Heuristics for spiky traffic:
- Keep it few and focused. More MVs ≠ faster; each MV consumes refresh resources and increases rewrite search cost. Prune unused MVs regularly. (Apache Doris)
- Accept staleness intentionally. Async MVs are great for T+1 or minute-level freshness; use
grace_periodif you can tolerate brief lag to ensure transparent rewrites hold under churn. (Apache Doris) - Partitioned MVs for time filters. If most queries hit
WHERE event_date >= ..., make the MV partitioned so refresh targets only hot partitions. (Apache Doris) - Layering works. Build a first-layer MV (fact enrichment) and a second-layer MV (heavy group-bys). Use trigger/periodic refresh combos to smooth load. (Apache Doris)
- DML rewrite too. When strong consistency holds, Doris can rewrite
INSERT INTO ... SELECT ...to target MVs for faster ETL steps. (Apache Doris)
Example (periodic refresh to survive spikes):
CREATE MATERIALIZED VIEW mv_orders_hourly
BUILD IMMEDIATE
REFRESH AUTO ON SCHEDULE EVERY 15 MINUTE
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT date_trunc('hour', order_ts) AS hr,
region_id,
SUM(amount) AS revenue,
COUNT(DISTINCT user_id) AS buyers
FROM fact_orders
GROUP BY date_trunc('hour', order_ts), region_id;
This lets BI drains read from mv_orders_hourly during peaks while base tables absorb writes calmly. Doris chooses the MV transparently when it matches. (Apache Doris)
Workload isolation: stop noisy neighbors during bursts
When ad-driven dashboards collide with backfills, isolate them:
- Workload Groups (in-process CPU/memory limits per BE) — set CPU shares, hard/soft memory caps, and queue/concurrency policies.
CREATE WORKLOAD GROUP hot_dashboards PROPERTIES ( "cpu_share"="30", "memory_limit"="25%", "enable_memory_overcommit"="true" );Bind BI users/services here; shove ETL into a separate group with lower shares during peak hours. (Apache Doris) - Concurrency & queuing — cap in-flight queries and queue excess when spikes hit to avoid OOMs. (Apache Doris)
- Resource/Compute Groups — if you operate multi-tenant or have distinct SLOs, tag BEs into groups for stronger isolation. (Apache Doris)
- Circuit breakers / SQL blocking — cancel outliers (e.g., >1s) or block pathological patterns during an incident.
-- Example: cancel queries running > 1000ms (policy) CREATE WORKLOAD POLICY cancel_long PROPERTIES("type"="query_timeout","value"="1000","enable"="true");Use temporary policies to protect the cluster when someone drags an accidental cross join into prod. (Apache Doris)
Quick tactics by symptom
| Symptom in spike | Fastest lever |
|---|---|
| CPU pegged, p95 rising | Lower parallel_pipeline_task_num globally a notch; move BI to a Workload Group with CPU share caps. Roll back later. (Apache Doris) |
| Many repeated dashboard queries | Turn on/lean on SQL Cache for stable reports; validate data freshness expectations. (Apache Doris) |
| Join waits / filter stalls | Check RF wait times; reduce RF wait for hot paths if necessary. (Apache Doris) |
| Scan concurrency too low | Raise query-level parallelism (8–16), verify CPU utilization and no thread contention. (Apache Doris) |
| Writes fighting reads | Shift heavy group-bys to async MVs with scheduled refresh; throttle ETL in a separate Workload Group. (Apache Doris) |
Example spike-ready setup (end-to-end)
- Stats + plan sanity
ANALYZE TABLE fact_orders WITH SAMPLE PERCENT 5;
EXPLAIN /* high-traffic dashboard query */;
- Pin low parallelism to hot endpoints
SELECT /*+ SET_VAR("parallel_pipeline_task_num=2") */ ...;
- Front-load the dashboard with an MV
CREATE MATERIALIZED VIEW mv_dash_preagg
BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 10 MINUTE
AS SELECT ... GROUP BY ...;
- Protect the cluster with a Workload Group
CREATE WORKLOAD GROUP if not exists dashboards
PROPERTIES ("cpu_share"="40","memory_limit"="30%");
Gotchas to avoid
- Creating a zoo of MVs “just in case.” Keep them minimal; each adds refresh + rewrite overhead. (Apache Doris)
- Global parallelism changes without a rollback plan. Prefer per-query/session overrides; keep a runbook for restoring defaults. (Apache Doris)
- Ignoring tablet/parallelism semantics. Newer Doris decouples scan concurrency from tablet count for some models—know which ones you use. (Apache Doris)
- Skipping stats. Nereids without stats is like a GPS with no satellites.
ANALYZE, then re-run the query. (Apache Doris)
Summary & CTA
When spikes hit, stability beats heroics. Trust Doris’ defaults (Nereids + pipeline), shape the workload with per-query SET_VAR and Workload Groups, and flatten resource peaks with small, well-chosen async MVs. Start with one dashboard: add an MV, cap its parallelism to 2–4, and place it in a Workload Group. Measure p95 and CPU before/after. If it moves the needle, templatize it across your top 5 reports.
Internal link ideas (official docs)
- Common tuning parameters (Nereids, runtime filters, pipeline parallelism). (Apache Doris)
- Parallelism tuning (SQL/session/global; examples; decoupled scans). (Apache Doris)
- Pipeline execution engine (why thread counts stay sane). (Apache Doris)
- Async MV overview (SPJG rewriting, external tables). (Apache Doris)
- Async MV best practices (grace period, layering, triggers). (Apache Doris)
- SQL Cache (when to cache repeated queries). (Apache Doris)
- Workload Groups & concurrency control (CPU/memory shares, queuing). (Apache Doris)
- Collecting statistics (
ANALYZE, show stats). (Apache Doris)
Image prompt
“A clean, modern architecture diagram of an Apache Doris cluster during a traffic spike: FE planners with Nereids, BE pipeline executors with adjustable parallelism, two Workload Groups (dashboards vs ETL), and an asynchronous MV layer absorbing reads — minimalistic, high contrast, isometric 3D style.”







Leave a Reply