Asynchronous Materialized Views in Apache Doris at Scale

Asynchronous Materialized Views in Apache Doris at Scale: Refresh Policies, Partitioned MVs, and Transparent Rewrite

Meta description (156 chars):
Learn how to design and operate Apache Doris asynchronous materialized views at scale—refresh modes, partitioned MVs, and transparent rewrite that speeds SQL.


The real problem (and why async MVs exist)

You have hot dashboards and complex joins over billions of rows. Recomputing aggregates on every query is wasteful; hand-rolled ETL tables drift or lag. Asynchronous materialized views (MVs) let Doris precompute results, then transparently rewrite incoming queries to hit those precomputed tables—so you keep SQL simplicity while getting warehouse-grade speed. Doris supports full and partition-incremental refresh, plus a cost-based rewriting algorithm for SPJG-style queries. (Apache Doris)


Concepts in 5 minutes

1) What an “async MV” is

An async MV is an internal table maintained by a background refresh job. You define it with a SELECT; Doris fills and refreshes it later via COMPLETE (full) or AUTO (attempt incremental) refresh. Triggers include MANUAL, ON SCHEDULE, or ON COMMIT. If you choose COMPLETE on a partitioned MV, it behaves like a non-partitioned MV (full rebuild). (Apache Doris)

2) Transparent rewrite (why queries magically speed up)

Doris examines incoming SQL and—when the shapes match—rewrites to the MV behind the scenes. The algorithm targets SPJG (SELECT-PROJECT-JOIN-GROUP BY) patterns and chooses the best MV by cost, so you don’t sprinkle hints across dashboards. Limitations apply (e.g., queries with window functions or UNION ALL aren’t rewrite candidates). (Apache Doris)

3) External tables & lakehouse nuance

MVs built over external tables (Hive/Iceberg/Hudi, etc.) can accelerate queries but don’t participate in transparent rewrite by default; enable with SET materialized_view_rewrite_enable_contain_external_table = true. Support varies by catalog and version. (Apache Doris)


Refresh policies that won’t bite you later

DecisionWhat it doesWhen to use
REFRESH … COMPLETERecomputes all MV data.Small MVs; schemas in flux; one-off fixes. (Apache Doris)
REFRESH … AUTOCalculates only partitions that changed since last refresh; falls back to full if needed.Large/partitioned fact tables; rolling windows. (Apache Doris)
Triggers: MANUAL / ON SCHEDULE / ON COMMITWhen refresh happens.ON SCHEDULE for periodic SLAs; ON COMMIT for tight freshness on smaller sources; MANUAL for ad hoc rebuilds. (Apache Doris)

Edge case with external tables: AUTO assumes them synchronized; use COMPLETE or explicit partitions when refreshing MV data built on external sources. (Apache Doris)

Staleness control: the MV property grace_period (seconds) lets rewrite use slightly stale partitions—handy when a dashboard tolerates small lag. (Apache Doris)


Partitioned MVs: the workhorse pattern

Partitioned MVs mirror the base table’s partitioning (often time). They’re your go-to for big facts because they refresh only what changed and still allow rewrite even when some partitions are invalid—Doris will UNION ALL valid MV partitions with base table data to serve the query. That gives graceful degradation instead of “no hit, full scan.” (Apache Doris)

When to choose partitioned MVs

  • Base table is large and partitioned.
  • Non-partitioned dimensions change infrequently (or you mark them as excluded).
  • The MV definition satisfies partition-derivation rules (see CREATE syntax requirements). (Apache Doris)

Keep MVs lean over time: limit MV data to the last N partitions using properties like partition_sync_limit and partition_sync_time_unit (e.g., “keep only last 90 days”). (Apache Doris)


A practical recipe (DDL you can paste)

1) Define a partitioned MV with scheduled, incremental refresh

-- Aggregate orders by day and join a small dimension
CREATE MATERIALIZED VIEW mv_orders_daily
BUILD IMMEDIATE
REFRESH AUTO
ON SCHEDULE EVERY 5 MINUTE
PARTITION BY (DATE_TRUNC(o_orderdate, 'DAY'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
  "grace_period" = "30",                     -- allow 30s staleness for rewrite
  "use_for_rewrite" = "true",                -- participate in transparent rewrite
  "workload_group" = "mv_refresh_light",     -- isolate refresh resource usage
  "partition_sync_limit" = "90",             -- keep only last 90 partitions
  "partition_sync_time_unit" = "DAY",
  "excluded_trigger_tables" = "dim_region"   -- changes here won't invalidate all
)
AS
SELECT
  DATE_TRUNC(o.orderdate, 'DAY') AS order_day,
  o.region_id,
  SUM(o.amount) AS revenue
FROM orders o
LEFT JOIN dim_region r ON o.region_id = r.id
GROUP BY DATE_TRUNC(o.orderdate, 'DAY'), o.region_id;

Why this works:

  • AUTO + PARTITION BY means incremental refresh only for changed days.
  • grace_period gives BI queries headroom to rewrite during near-real-time ingest.
  • excluded_trigger_tables keeps slow-changing dims from invalidating every partition.
    All features are straight from the official CREATE/Best-Practices docs. (Apache Doris)

2) Operate it safely

-- Let Doris figure out what changed
REFRESH MATERIALIZED VIEW mv_orders_daily AUTO;

-- Force a full rebuild (e.g., schema drift fix)
REFRESH MATERIALIZED VIEW mv_orders_daily COMPLETE;

-- Refresh only specific partitions
REFRESH MATERIALIZED VIEW mv_orders_daily
  PARTITIONS(p_2025_11_15, p_2025_11_16);

These commands match the official REFRESH MATERIALIZED VIEW statement and behaviors. (Apache Doris)

3) Make external tables eligible for rewrite (if you use them)

SET materialized_view_rewrite_enable_contain_external_table = true;

This opt-in is required when your MV involves external tables and you still want transparent rewrite. Check the per-catalog support notes. (Apache Doris)


Design checklist (battle-tested)

Shape your MV for rewrite

  • Target SPJG-style queries; avoid UNION ALL, ORDER BY, LIMIT in the MV if you want rewrite hits. Window functions won’t rewrite. (Apache Doris)

Choose refresh policy by SLA

  • If freshness ≤ minutes and data is partitioned, prefer AUTO + ON SCHEDULE (e.g., every 5–10 min).
  • For trickle-writes on small tables, ON COMMIT is viable. For big rebuilds, MANUAL + ops playbook. (Apache Doris)

Control blast radius

  • Use workload_group to throttle refresh resource consumption.
  • Keep the number of MVs reasonable; too many MVs increase rewrite planning cost and refresh overhead. Periodically retire unused MVs. (Apache Doris)

Model for partition-incremental

  • Ensure base facts are partitioned (RANGE/LIST), pick one partition key in the MV, and satisfy the partition-derivation rules in the CREATE docs. (Apache Doris)

Operate with guardrails

  • When a referenced non-partitioned table changes often, either move it to a partitioned fact or use excluded_trigger_tables.
  • For external-table MVs, remember AUTO’s caveat and prefer COMPLETE or explicit partitions when refreshing. (Apache Doris)

Common pitfalls (and quick escapes)

  • No rewrite hit when one partition is stale. That’s expected for that slice; use partitioned MVs so Doris can still rewrite for the other partitions (MV UNION ALL base). (Apache Doris)
  • MV keeps growing forever. Set partition_sync_limit to retain only recent partitions. (Apache Doris)
  • External table MV “doesn’t accelerate.” Enable rewrite for external tables via the session variable; verify catalog support. (Apache Doris)
  • Over-eager full refreshes. Prefer AUTO and specify partitions for hotfixes; reserve COMPLETE for known rebuilds. (Apache Doris)

Summary & call-to-action

Async MVs give Doris the speed of precomputation and the ease of SQL, with scalable refresh and smart rewrite. Start with one high-value query pattern, build a partitioned MV with AUTO refresh and a short schedule, set a small grace_period, and measure query latency and CPU. If it pays off, standardize the pattern across your busiest datasets. (Apache Doris)


Internal link ideas (official docs)

  • Create Async MV — syntax, triggers, properties (grace_period, use_for_rewrite, partition_sync_limit). (Apache Doris)
  • Refresh MVAUTO, COMPLETE, partition refresh, external-table caveat. (Apache Doris)
  • Best Practices — partitioned MV strategy, invalid partitions, excluded triggers. (Apache Doris)
  • Transparent rewrite — SPJG model and limits; external-table toggle and catalog support. (Apache Doris)

Image prompt

“A clean, modern diagram of Apache Doris async materialized views: base partitioned fact table flows into a partitioned MV with AUTO refresh on a 5-minute schedule; FE plans queries that transparently rewrite to MV (with UNION ALL to base for stale partitions); include knobs like grace_period and workload_group — minimalistic, high contrast, 3D isometric style.”


Tags

#ApacheDoris #MaterializedViews #QueryAcceleration #RealTimeAnalytics #DataEngineering #Lakehouse #SQL #BigData #Performance #WarehouseOps

Leave a Reply

Your email address will not be published. Required fields are marked *