Designing Async Materialized Views in VeloDB

Designing Async Materialized Views in VeloDB (Apache Doris) for Sub-Second Dashboards

Meta description (159 chars):
Learn how to design, schedule, and tune asynchronous materialized views in VeloDB (Apache Doris) for sub-second BI dashboards with safe rewrites and low costs.


Why this matters

Your stakeholders want “live” dashboards, not nightly reports. Async materialized views (MVs) let you precompute joins/aggregations, then route queries to those results automatically—so dashboards feel instant without burning compute on every request. In VeloDB (built on Apache Doris), async MVs support multi-table definitions, incremental refresh, and transparent query rewrite, which is exactly what user-facing analytics need. (VeloDB Docs)


Concept in plain terms

Two flavors of MVs exist:

  • Synchronous MV (sync MV): maintained in real-time, single-table only; great when you need strong freshness. (VeloDB Docs)
  • Asynchronous MV (async MV): single- or multi-table, refreshed on a schedule or on demand, with options for partition-incremental refresh. That makes it ideal for sub-second reads with minute-level freshness. (VeloDB Docs)

Transparent rewrite: Doris/VeloDB automatically matches eligible queries to your MV (SPJG algorithm), so your BI tools keep the original SQL while the engine serves results from the MV. Use EXPLAIN to confirm rewrites. (Apache Doris)


Quick mental model (and constraints)

  • Build & refresh: BUILD IMMEDIATE | DEFERRED, then REFRESH COMPLETE | AUTO with ON MANUAL or ON SCHEDULE EVERY <n> <unit>. AUTO attempts partition-incremental refresh. (Apache Doris)
  • Partitioning: Choose MV partitions that align with the base table’s time partition (e.g., daily), otherwise you’ll forfeit incremental refresh and pruning. (Apache Doris)
  • Grace period: For transparent rewrite during ingest churn, you can allow a bounded staleness window via grace_period. (Apache Doris)
  • External sources: Incremental change detection currently covers internal tables and Hive; other sources may require REFRESH COMPLETE. (Apache Doris)

Sync vs. Async MVs (choose deliberately)

FeatureSync MVAsync MV
Query rewriteAutomaticAutomatic
Direct queryNot supportedSupported
Tables allowedSingle-table onlySingle- or multi-table
FreshnessReal-timeScheduled or manual (eventual)
Incremental refreshN/APartition-incremental via AUTO

Source: VeloDB/Doris MV overviews and async MV docs. (VeloDB Docs)


A practical blueprint (orders → 1-minute rollup)

Goal: live minute-bucket KPIs (orders, revenue) refreshed every 5 minutes with partition-incremental updates.

1) Base events (already time-partitioned)

CREATE TABLE orders_events (
  event_time    DATETIME NOT NULL,
  order_id      BIGINT   NOT NULL,
  customer_id   BIGINT   NOT NULL,
  status        VARCHAR(16),
  amount_usd    DECIMAL(12,2)
)
DUPLICATE KEY(event_time, order_id)
PARTITION BY RANGE(event_time) (
  FROM ("2025-01-01") TO ("2026-01-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(order_id) BUCKETS 16;

Ingest continuously (Kafka → Routine Load or Flink connector). (Apache Doris)

2) Async MV (minute rollup, scheduled refresh)

CREATE MATERIALIZED VIEW mv_orders_1min
BUILD IMMEDIATE
REFRESH AUTO ON SCHEDULE EVERY 5 MINUTE
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY RANDOM BUCKETS 16
AS
SELECT
  date_trunc('minute', event_time) AS ts_minute,
  COUNT(*)                         AS orders,
  SUM(amount_usd)                  AS revenue,
  SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END) AS failed
FROM orders_events
GROUP BY date_trunc('minute', event_time);
  • REFRESH AUTO enables partition-incremental refresh (changed base partitions only).
  • PARTITION BY date_trunc('day', …) aligns MV partitions with the table’s time window, preserving pruning and incremental behavior. (Apache Doris)

3) Verify rewrites (and that your MV is “hit”)

EXPLAIN MEMO PLAN
SELECT date_trunc('minute', event_time) AS ts_minute,
       COUNT(*), SUM(amount_usd)
FROM orders_events
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY date_trunc('minute', event_time);

Look for the MATERIALIZATIONS section showing the chosen MV. (Apache Doris)

4) Operate and observe

  • Check MV status:
SELECT *
FROM mv_infos('database'='analytics')
WHERE Name = 'mv_orders_1min';
  • Force refresh: REFRESH MATERIALIZED VIEW mv_orders_1min AUTO; or ... COMPLETE;
  • Pause/Resume: PAUSE MATERIALIZED VIEW ...; / RESUME MATERIALIZED VIEW ...;
  • Zero-downtime change: ALTER MATERIALIZED VIEW mv_old REPLACE WITH MATERIALIZED VIEW mv_new; (atomic swap). (Apache Doris)

Best practices (what actually works)

  1. Design to the query, not the schema. Start from your hot BI queries and mirror their shape (grouping keys, filters, joins). The closer the MV to the query, the bigger the speedup—at the cost of generality. Balance both. (Apache Doris)
  2. Partition for incremental refresh. Pick daily (or hourly) partitions that match your SLA; misalignment kills pruning and forces full refresh. (Apache Doris)
  3. Use grace_period to tolerate ingestion lag. It lets rewrites proceed within a small staleness window, maintaining snappy dashboards during heavy writes. (Apache Doris)
  4. Limit MV sprawl. Fewer, well-used MVs beat dozens of near-duplicates. Validate with EXPLAIN and usage metrics before creating another. (Apache Doris)
  5. Mind external sources. If your MV reads JDBC/external tables, Doris can’t always detect changes—prefer COMPLETE or land the data internally/Hive for incremental detection. (Apache Doris)
  6. Schedule by business need, not habit. Start at 5–15 minutes for product dashboards; tighten only if users truly need it. (Compute isn’t free.)

Common pitfalls (and how to avoid them)

  • Wrong MV type: Trying to sync multi-table logic with sync MVs won’t fly—use async for joins/rollups. (VeloDB Docs)
  • Over-wide GROUP BY: MV cardinality explodes; aggregate at the grain users actually slice by.
  • Unaligned time buckets: If users query minute buckets but your MV is hourly, rewrites often miss. Align date_trunc() with query buckets. (Apache Doris)
  • One giant nightly refresh: Prefer incremental (AUTO) plus a periodic full (e.g., daily) to heal drift. (Apache Doris)

Optional: speed up data retrieval from Python

If you extract MV results to services or notebooks, use Arrow Flight SQL with the Python ADBC driver for high-throughput reads:

import adbc_driver_flightsql.dbapi as flightsql
conn = flightsql.connect(uri="grpc://<fe-host>:<port>", db="analytics")
cur  = conn.cursor()
cur.execute("""
  SELECT ts_minute, orders, revenue
  FROM mv_orders_1min
  WHERE ts_minute >= now() - interval 1 hour
""")
rows = cur.fetchall()

Docs: Arrow Flight SQL connection guides for Doris/VeloDB. (Apache Doris)


Conclusion & takeaways

Async MVs are the right tool when you need fast, predictable dashboards with minute-level freshness. Design your MV from the top query, align partitions and time buckets, enable incremental refresh, and set a grace period to ride out ingest spikes. Then use EXPLAIN and mv_infos to verify results, and evolve definitions with atomic replacement—no downtime. (Apache Doris)

Call to action: Start with one high-value dashboard, ship a single async MV, confirm rewrite, then iterate on schedule and grain based on real usage.


Internal link ideas (official-docs friendly)

  • “Async MV overview and creation syntax” (VeloDB/Doris) (VeloDB Docs)
  • “Transparent query rewriting details” (Doris) (Apache Doris)
  • “Routine Load from Kafka” (Doris) (Apache Doris)
  • “Arrow Flight SQL connection guide” (Doris/VeloDB) (Apache Doris)

Image prompt (for DALL·E / Midjourney)

“A clean, modern diagram of VeloDB (Apache Doris) async materialized views: base event table → scheduled incremental refresh → MV partitions → transparent rewrite to BI queries. Minimalist, high-contrast, isometric style.”


Tags

#VeloDB #ApacheDoris #MaterializedViews #RealTimeAnalytics #DataEngineering #BI #QueryOptimization #Streaming #Kafka #ArrowFlightSQL

Leave a Reply

Your email address will not be published. Required fields are marked *