Tuning Doris for Spiky Traffic

Tuning Doris for Spiky Traffic: Nereids knobs, pipeline parallelism, and MV selection heuristics

Meta description (158 chars):
Practical Apache Doris tuning for traffic spikes: Nereids settings, pipeline parallelism, async MV strategies, and workload isolation to keep p95 low and stable.


The spike problem (and why Doris can win)

Traffic spikes are messy: product launches, ad bursts, or “one dashboard to rule them all.” You need sub-second reads while writes keep streaming in. Apache Doris gives you a fighting chance with a cost-based optimizer (Nereids), a pipeline execution engine that saturates cores without thread explosions, and async materialized views (MVs) for transparent query rewrites. Your job: point these tools at the right bottlenecks. (Apache Doris)


What to tune first (order of operations)

  1. Verify good plans before cranking parallelism. Use EXPLAIN/Profile; bad plans at 32× parallelism are still bad. Then collect stats. EXPLAIN VERBOSE /* your query */; ANALYZE TABLE fact_orders WITH SAMPLE PERCENT 10; SHOW COLUMN STATS fact_orders; Nereids uses stats; ANALYZE fills them, and EXPLAIN shows the plan you’re actually running. (Apache Doris)
  2. Let Doris’ defaults carry you. The pipeline engine is on; Nereids is on by default since 2.0. Don’t toggle them off unless you know why. (Apache Doris)
  3. Adjust parallelism where it pays. Use query/session hints first, not cluster-wide switches. Default parallel_pipeline_task_num = 0 lets Doris adapt (≈ half cores/BE). (Apache Doris)

Nereids: practical knobs that matter

  • Enable the modern optimizer & DML path (on by default):
    enable_nereids_planner=TRUE, enable_nereids_dml=TRUE. Use them with current releases. (Apache Doris)
  • Runtime filters: keep runtime_filter_mode=GLOBAL unless profiling shows RF wait time pressure; then tune RF wait behavior. (Apache Doris)
  • Stats, always: after heavy backfills or schema shifts, re-ANALYZE or schedule auto-collection; Nereids’ cost model needs fresh row counts and NDVs to pick joins/MVs well. (Apache Doris)

Pipeline parallelism: fast, safe adjustments

Doris’ pipeline engine scales per BE with controlled threads. Start at defaults; pin overrides to SQL with SET_VAR hints so you don’t penalize other workloads. (Apache Doris)

How to set it (lowest blast radius first)

-- Per-query (recommended during spikes)
SELECT /*+ SET_VAR("parallel_pipeline_task_num=4") */
  ...
FROM ...

-- Session scope for a batch of reports
SET parallel_pipeline_task_num = 4;

-- As a last resort, global (consider a change window)
SET GLOBAL parallel_pipeline_task_num = 4;

The variable controls the number of worker tasks per fragment on each BE; default 0 means “adaptive” (≈ half the BE cores). In simple point lookups, set =1; in CPU-bound big joins/aggregations, raise carefully (e.g., 8–16) and watch Profile + CPU. Since 2.1, scan parallelism can exceed the number of tablets for Duplicate/Unique-MOW tables. (Apache Doris)


Async MVs: selection heuristics that actually work

Doris rewrites SPJG-shape queries to the best matching MV via cost-based selection. That means you should design a small set of purpose-built MVs that mirror your real query patterns (filters, joins, group-bys). (Apache Doris)

Heuristics for spiky traffic:

  • Keep it few and focused. More MVs ≠ faster; each MV consumes refresh resources and increases rewrite search cost. Prune unused MVs regularly. (Apache Doris)
  • Accept staleness intentionally. Async MVs are great for T+1 or minute-level freshness; use grace_period if you can tolerate brief lag to ensure transparent rewrites hold under churn. (Apache Doris)
  • Partitioned MVs for time filters. If most queries hit WHERE event_date >= ..., make the MV partitioned so refresh targets only hot partitions. (Apache Doris)
  • Layering works. Build a first-layer MV (fact enrichment) and a second-layer MV (heavy group-bys). Use trigger/periodic refresh combos to smooth load. (Apache Doris)
  • DML rewrite too. When strong consistency holds, Doris can rewrite INSERT INTO ... SELECT ... to target MVs for faster ETL steps. (Apache Doris)

Example (periodic refresh to survive spikes):

CREATE MATERIALIZED VIEW mv_orders_hourly
BUILD IMMEDIATE
REFRESH AUTO ON SCHEDULE EVERY 15 MINUTE
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT date_trunc('hour', order_ts) AS hr,
       region_id,
       SUM(amount) AS revenue,
       COUNT(DISTINCT user_id) AS buyers
FROM fact_orders
GROUP BY date_trunc('hour', order_ts), region_id;

This lets BI drains read from mv_orders_hourly during peaks while base tables absorb writes calmly. Doris chooses the MV transparently when it matches. (Apache Doris)


Workload isolation: stop noisy neighbors during bursts

When ad-driven dashboards collide with backfills, isolate them:

  • Workload Groups (in-process CPU/memory limits per BE) — set CPU shares, hard/soft memory caps, and queue/concurrency policies. CREATE WORKLOAD GROUP hot_dashboards PROPERTIES ( "cpu_share"="30", "memory_limit"="25%", "enable_memory_overcommit"="true" ); Bind BI users/services here; shove ETL into a separate group with lower shares during peak hours. (Apache Doris)
  • Concurrency & queuing — cap in-flight queries and queue excess when spikes hit to avoid OOMs. (Apache Doris)
  • Resource/Compute Groups — if you operate multi-tenant or have distinct SLOs, tag BEs into groups for stronger isolation. (Apache Doris)
  • Circuit breakers / SQL blocking — cancel outliers (e.g., >1s) or block pathological patterns during an incident. -- Example: cancel queries running > 1000ms (policy) CREATE WORKLOAD POLICY cancel_long PROPERTIES("type"="query_timeout","value"="1000","enable"="true"); Use temporary policies to protect the cluster when someone drags an accidental cross join into prod. (Apache Doris)

Quick tactics by symptom

Symptom in spikeFastest lever
CPU pegged, p95 risingLower parallel_pipeline_task_num globally a notch; move BI to a Workload Group with CPU share caps. Roll back later. (Apache Doris)
Many repeated dashboard queriesTurn on/lean on SQL Cache for stable reports; validate data freshness expectations. (Apache Doris)
Join waits / filter stallsCheck RF wait times; reduce RF wait for hot paths if necessary. (Apache Doris)
Scan concurrency too lowRaise query-level parallelism (8–16), verify CPU utilization and no thread contention. (Apache Doris)
Writes fighting readsShift heavy group-bys to async MVs with scheduled refresh; throttle ETL in a separate Workload Group. (Apache Doris)

Example spike-ready setup (end-to-end)

  1. Stats + plan sanity
ANALYZE TABLE fact_orders WITH SAMPLE PERCENT 5;
EXPLAIN /* high-traffic dashboard query */;

(Apache Doris)

  1. Pin low parallelism to hot endpoints
SELECT /*+ SET_VAR("parallel_pipeline_task_num=2") */ ...;

(Apache Doris)

  1. Front-load the dashboard with an MV
CREATE MATERIALIZED VIEW mv_dash_preagg
BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 10 MINUTE
AS SELECT ... GROUP BY ...;

(Apache Doris)

  1. Protect the cluster with a Workload Group
CREATE WORKLOAD GROUP if not exists dashboards
PROPERTIES ("cpu_share"="40","memory_limit"="30%");

(Apache Doris)


Gotchas to avoid

  • Creating a zoo of MVs “just in case.” Keep them minimal; each adds refresh + rewrite overhead. (Apache Doris)
  • Global parallelism changes without a rollback plan. Prefer per-query/session overrides; keep a runbook for restoring defaults. (Apache Doris)
  • Ignoring tablet/parallelism semantics. Newer Doris decouples scan concurrency from tablet count for some models—know which ones you use. (Apache Doris)
  • Skipping stats. Nereids without stats is like a GPS with no satellites. ANALYZE, then re-run the query. (Apache Doris)

Summary & CTA

When spikes hit, stability beats heroics. Trust Doris’ defaults (Nereids + pipeline), shape the workload with per-query SET_VAR and Workload Groups, and flatten resource peaks with small, well-chosen async MVs. Start with one dashboard: add an MV, cap its parallelism to 2–4, and place it in a Workload Group. Measure p95 and CPU before/after. If it moves the needle, templatize it across your top 5 reports.


Internal link ideas (official docs)

  • Common tuning parameters (Nereids, runtime filters, pipeline parallelism). (Apache Doris)
  • Parallelism tuning (SQL/session/global; examples; decoupled scans). (Apache Doris)
  • Pipeline execution engine (why thread counts stay sane). (Apache Doris)
  • Async MV overview (SPJG rewriting, external tables). (Apache Doris)
  • Async MV best practices (grace period, layering, triggers). (Apache Doris)
  • SQL Cache (when to cache repeated queries). (Apache Doris)
  • Workload Groups & concurrency control (CPU/memory shares, queuing). (Apache Doris)
  • Collecting statistics (ANALYZE, show stats). (Apache Doris)

Image prompt

“A clean, modern architecture diagram of an Apache Doris cluster during a traffic spike: FE planners with Nereids, BE pipeline executors with adjustable parallelism, two Workload Groups (dashboards vs ETL), and an asynchronous MV layer absorbing reads — minimalistic, high contrast, isometric 3D style.”


#ApacheDoris #Nereids #MaterializedViews #PipelineParallelism #WorkloadIsolation #DataEngineering #RealTimeAnalytics #QueryOptimization #SQL #OLAP

Leave a Reply

Your email address will not be published. Required fields are marked *