VeloDB – Data/ML Engineer Blog

VeloDB for Real-Time Analytics: A Practical Guide for Mid-Level Data Engineers

VeloDB, which utilizes Apache Doris, represents a cutting-edge data warehouse designed for rapid analytics on large-scale real-time data.

It features both push-based micro-batch and pull-based streaming data ingestion that occurs in mere seconds, alongside a storage engine capable of real-time upserts, appends, and pre-aggregations. The platform delivers exceptional performance for real-time data serving and allows for dynamic interactive ad-hoc queries.

VeloDB accommodates not only structured data but also semi-structured formats, supporting both real-time analytics and batch processing capabilities. Moreover, it functions as a federated query engine, enabling seamless access to external data lakes and databases in addition to internal data.

The system is designed for distribution, ensuring linear scalability. Users can deploy it on-premises or as a cloud service, allowing for adaptable resource allocation based on workload demands, whether through separation or integration of storage and compute resources.

Leveraging the strengths of open-source Apache Doris, VeloDB supports the MySQL protocol and various functions, allowing for straightforward integration with a wide range of data tools, ensuring flexibility and compatibility across different environments.

Why this matters

You’re asked to power customer-facing dashboards that feel instant, join fresh events with historical context, and keep costs sane. Traditional warehouses struggle under <1-second latencies and high concurrency. VeloDB, a commercial distribution of Apache Doris, targets exactly this zone: real-time OLAP, simple ops, and SQL you already know. (VeloDB)

What VeloDB is (in one paragraph)

VeloDB is a real-time, cloud-native data warehouse powered by Apache Doris. It’s offered both as fully managed SaaS and BYOC (Bring Your Own Cloud) so you can run compute inside your VPC while using VeloDB’s control plane. The focus is sub-second queries and high concurrency for analytics and AI/agent workloads. (VeloDB Docs)

Core architecture (short, practical)

Under the hood you’re standing on Doris’s FE/BE design:

FE (Frontends): SQL parsing, planning, metadata, cluster management.
BE (Backends): Columnar storage and vectorized execution.

This two-tier MPP setup keeps ops simple while scaling horizontally. Doris speaks MySQL protocol, so basic tooling “just works.” (Apache Doris)

Mental model: FE = brains, BE = brawn. You scale reads/writes by adding BEs; you harden availability with multiple FEs.

Deployment options you’ll actually use

SaaS: Elastic compute, storage, and cache with pay-as-you-go billing. Great for fast pilots. Pricing surfaces compute (vCPU/h), storage (GB/h), and cache (GB/h) so you can reason about spend drivers. (VeloDB)
BYOC: VeloDB deploys an agent into your VPC (via CloudFormation/ARM). Control traffic flows via PrivateLink/private endpoints; data never leaves your cloud account. Azure/AWS templates and steps are documented end-to-end. Typical warehouse init completes in ~5–10 minutes. (VeloDB Docs)

Lakehouse, semi-structured, and AI hooks

Lakehouse compute: Query Iceberg/Hudi on object storage; plug into catalogs (Unity/Glue/HMS). VeloDB positions itself as the real-time analytics engine sitting alongside batch engines like Spark. (VeloDB)
Open connectors: Doris natively connects to Hive/Iceberg/Hudi and more, planning distributed reads via the MPP engine. (Apache Doris)
Semi-structured: Official guidance and patterns cover JSON-heavy analytics. (Useful when your events/logs aren’t fully normalized.) (VeloDB)
AI/agents: VeloDB markets “agent-facing analytics” and AI observability use cases; think fast aggregates for prompts, RAG metrics, and feature monitoring. (VeloDB)

Data modeling in VeloDB (Doris models)

Pick the table type by update semantics and access pattern:

Table type	Best for	Notes
DUPLICATE KEY	Raw event streams / ad-hoc exploration	Default type; choose a short sort key (≤3 cols). (Apache Doris)
UNIQUE KEY	Upserts on entity state (idempotent writes)	Maintains latest row by key. (Apache Doris)
AGGREGATE KEY	Pre-aggregated fact tables	Pushdown SUM/MIN/MAX/REPLACE at write time. (Apache Doris)

Partitioning & sort keys
Partition by a time column that matches your retention/SLA windows (day/hour). Keep sort keys minimal to maximize scan efficiency. (Apache Doris)

Query acceleration you should plan for

Materialized Views (MVs): Doris supports single-table synchronous MVs and multi-table asynchronous MVs with transparent query rewrite. Use them for rollups, sessionization, or dimensional pre-joins. (Apache Doris)
Arrow Flight SQL: For fast data access from Python/Java via ADBC, enable Arrow Flight and move data off the cluster far faster than traditional drivers. (Apache Doris)

Ingestion patterns (battle-tested)

Streaming: Feed Kafka → VeloDB via Routine Load or the Flink Doris Connector (also supports Flink CDC for MySQL, etc.). Tune batch size/intervals for throughput vs. freshness. (Apache Doris)
Batch: Stage files (Parquet/CSV) into object storage and load. (Good for historic backfills; combine with MVs for query speed.)

End-to-end example: streaming orders with a rollup MV

Scenario: ingest orders_stream events and keep a live, sub-second aggregate for dashboards.

1) Create a raw events table (Duplicate Key)

CREATE TABLE orders_events (
  event_time       DATETIME NOT NULL,
  order_id         BIGINT   NOT NULL,
  customer_id      BIGINT   NOT NULL,
  status           VARCHAR(16),
  amount_usd       DECIMAL(12,2)
)
DUPLICATE KEY(event_time, order_id)
PARTITION BY RANGE(event_time) (
  FROM ("2025-01-01") TO ("2026-01-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(order_id) BUCKETS 16;

2) Stream from Kafka (Flink Connector sketch)

-- Flink SQL (conceptual)
CREATE TABLE kafka_orders (...) WITH (...);           -- your Kafka source
CREATE TABLE doris_orders_events (...) WITH (...);   -- Doris/VeloDB sink
INSERT INTO doris_orders_events
SELECT event_time, order_id, customer_id, status, amount_usd
FROM kafka_orders;

Use Routine Load or Flink Doris Connector depending on how much transform you need in-stream. (Apache Doris)

3) Accelerate with an async MV

CREATE ASYNC MATERIALIZED VIEW mv_orders_1min
BUILD IMMEDIATE
REFRESH ASYNC
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(order_id) BUCKETS 16
AS
SELECT
  date_trunc('minute', event_time) AS ts_minute,
  count(*)                         AS orders,
  sum(amount_usd)                  AS revenue,
  sum(CASE WHEN status='FAILED' THEN 1 ELSE 0 END) AS failed
FROM orders_events
GROUP BY date_trunc('minute', event_time);

Doris rewrites queries to hit the MV transparently; async refresh keeps ingest hot while rollups update continuously. (Apache Doris)

4) Read results at speed with Arrow Flight SQL (Python sketch)

import adbc_driver_flightsql.dbapi as flightsql

conn = flightsql.connect(uri="grpc://your-fe:{{port}}", db="analytics")
cur  = conn.cursor()
cur.execute("SELECT ts_minute, orders, revenue FROM mv_orders_1min WHERE ts_minute >= now() - interval 1 hour")
rows = cur.fetchall()

Enable Arrow Flight SQL on FE/BE and use the ADBC driver for maximum throughput. (Apache Doris)

Cost & performance tuning checklist

Right table type first: Don’t shove events into UNIQUE if you don’t need upserts; DUPLICATE often wins for raw streams. (Apache Doris)
Partition to your SLA: Daily partitions are a sane default; go hourly only if needed. Keep partitions aligned with MV PARTITION BY. (Apache Doris)
Materialize what users actually query: Mirror the real query shapes. High-hit aggregates (by time bucket, by top N dimensions) deserve their own MV. (Apache Doris)
Isolate workloads: In SaaS, run multiple compute clusters (ingest vs. BI) so heavy ETL doesn’t starve dashboards. Auto-scale clusters to user traffic. (VeloDB)
Watch the three meters: In Cloud, you pay for compute, storage, and cache. Streaming + wide tables + over-materialization will inflate all three. Start narrow, widen with evidence. (VeloDB)

Security & compliance notes (BYOC)

With BYOC, VeloDB provisions an agent and networking inside your VPC and drives lifecycle via a private control channel (e.g., PrivateLink). This pattern keeps data plane traffic inside your account, which is often critical for regulated environments. Cloud templates and minimum permissions are documented for AWS/Azure. (VeloDB Docs)

When VeloDB is a good fit

Choose it when you need:

Sub-second user-facing analytics on fresh data.
High concurrency without Kafka-to-OLTP gymnastics.
Lakehouse adjacency with familiar SQL + MySQL protocol. (VeloDB)

If your workloads are mostly long, batchy transformations, any modern warehouse will do. VeloDB shines when latency and concurrency are the real constraints.

Common pitfalls

Wrong table model → upserts into DUPLICATE or aggregates into UNIQUE. Revisit model before scaling hardware. (Apache Doris)
MV sprawl → dozens of overlapping async MVs can balloon compute and cache; consolidate by query popularity. (Apache Doris)
Mis-aligned partitions → MV refresh and partition pruning suffer if base/MV don’t align. (Apache Doris)
One big cluster for everything → isolate ingest, BI, and data science. (VeloDB)

Summary

VeloDB gives you Doris’s fast, simple MPP core with managed cloud ergonomics and BYOC control. If your product or ops teams demand near-real-time insights with predictable costs, it’s a credible default for event-heavy analytics stacks—especially when paired with MVs, Arrow Flight SQL, and lakehouse tables. (VeloDB Docs)

Call to action: Start with a single streaming use case, one MV, and two clusters (ingest + BI). Measure latency and concurrency before you add more models.

Internal link ideas (for your site)

“Materialized Views in Apache Doris: Patterns for Real-Time Rollups”
“Designing Sort Keys and Partitions for Event Streams”
“Arrow Flight SQL 101: Fast Data Access from Python”
“BYOC vs. SaaS: Choosing the Right Deployment for Real-Time Analytics”
“From Kafka to Dashboard: An End-to-End VeloDB Pipeline”

Image prompt (for DALL·E / Midjourney)

“A clean, modern data architecture diagram showing VeloDB on Apache Doris in a multi-cloud setup: FE/BE nodes, Kafka ingestion, materialized views, and BI users. Minimalist, high-contrast, isometric, vector style.”

Data/ML Engineer Blog