Doris – Data/ML Engineer Blog

Apache Doris for Real-Time Analytics: A Practical Guide for Mid-Level Data Engineers

Apache Doris serves as a cutting-edge data warehouse tailored for real-time analytics, enabling exceptionally rapid analysis of data at scale.

It features both push-based micro-batch and pull-based streaming data ingestion that occurs within a second, alongside a storage engine capable of real-time upserts, appends, and pre-aggregation.

With its columnar storage architecture, MPP design, cost-based query optimization, and vectorized execution engine, it is optimized for handling high-concurrency and high-throughput queries efficiently.

Moreover, it allows for federated querying across various data lakes, including Hive, Iceberg, and Hudi, as well as relational databases such as MySQL and PostgreSQL.

Doris supports complex data types like Array, Map, and JSON, and includes a Variant data type that facilitates automatic inference for JSON structures, along with advanced text search capabilities through NGram bloomfilters and inverted indexes.

Its distributed architecture ensures linear scalability and incorporates workload isolation and tiered storage to enhance resource management.

Additionally, it accommodates both shared-nothing clusters and the separation of storage from compute resources, providing flexibility in deployment and management.

Why Doris matters (and when it saves your bacon)

You’ve got product managers asking for “sub-second” dashboards, logs streaming nonstop, and a BI team that won’t give up SQL. Row stores choke. Spark jobs are too slow. You need an OLAP engine that ingests quickly, speaks MySQL-flavored SQL, and answers fast. That’s where Apache Doris shines: a distributed, columnar MPP database built for low-latency analytics with simple ops. (Apache Doris)

What Apache Doris is (in one mental model)

Doris is a distributed, columnar, MPP system with two core node types:

Frontend (FE): metadata, query parsing/optimization, cluster management.
Backend (BE): storage and query execution (vectorized + pipeline engines). (Apache Doris)

It speaks the MySQL protocol, so your existing MySQL clients/JDBC/ODBC and BI tools can connect with minimal friction. (Apache Doris)

Doris runs in compute–storage coupled or decoupled modes (object storage lakehouse style), so you can choose ops simplicity or cloud elasticity. (Apache Doris)

Core architecture, mapped to the real problems you have

Concept	What it is	Why it matters
Tablet	The smallest physical shard of data	Drives parallelism, replication, balancing, repair. (Apache Doris)
Partition → Bucket	Two-level distribution: range/list partitions; each partition split into buckets → tablets via HASH or RANDOM	Controls scan volume & concurrency; avoids skew. (Apache Doris)
Replication	Default replication_num = 3 per tablet	Availability & repair; can tune per table/partition. (Apache Doris)
Colocation	Aligns shard layouts across tables	Enables local (no-shuffle) joins when schemas/buckets match. (Apache Doris)
Execution	Vectorized operators + Pipeline engine	Keeps CPUs hot, limits thread explosion; big win on wide-table aggregations. (Apache Doris)
Optimizer	Nereids (CBO/RBO)	Smarter plans by default; enabled in 2.x+. (Apache Doris)

Data modeling in Doris (the 80/20 you need)

1) Pick the right table model

DUPLICATE KEY – raw event/detail (no pre-agg); best for flexible ad-hoc.
UNIQUE KEY – upserts on primary key.
AGGREGATE KEY – pre-aggregated rollups (SUM/MIN/MAX/HLL/Bitmap). (Apache Doris)

Tip: Models are fixed at create-time—decide up front. (Apache Doris)

2) Partitioning & bucketing that actually works

Partition by a natural prune key (date/tenant).
Bucket by a high-cardinality join/filter key (e.g., user_id) via DISTRIBUTED BY HASH(col) BUCKETS N. Doris also supports RANDOM bucketing and Auto Partition when ranges are unpredictable. (Apache Doris)

Rule of thumb: Tablet count = partitions × buckets × replication. Ensure enough tablets for concurrency, but don’t create “microscopic” tablets. (Apache Doris)

3) Index & summary structures

Built-ins: prefix & ZoneMap (automatic).
Secondary: BloomFilter, Bitmap, Inverted/N-Gram Bloom for text/searchy workloads. (Apache Doris)
HLL / Bitmap data types for DISTINCT counts (approx/precise). (Apache Doris)

Note: Bloom filters are set via table property bloom_filter_columns. (Apache Doris)

Getting data in (and keeping it fresh)

Common paths to production:

Routine Load (Kafka) – continuous ingest; exactly-once semantics for CSV/JSON. Great for CDC/logs. (Apache Doris)
Stream Load – push via HTTP for micro-batches. (Apache Doris)
Broker/Bulk Load – pull from object storage/HDFS (async). (Apache Doris)
Lakehouse catalogs – query/write external Iceberg/Hive/Hudi; then INSERT INTO … SELECT … to land data internally when needed. (Apache Doris)

Query acceleration that feels “unfair”

Nereids planner on by default—cost-based decisions without hand-tuning. (Apache Doris)
Pipeline execution maximizes multicore throughput. (Apache Doris)
Asynchronous Materialized Views (MVs) with AUTO/COMPLETE/PARTITION refresh & transparent rewrite; ideal for T+1 rollups. (Apache Doris)

Hands-on example: events to daily revenue in minutes

Create a detail table (DUPLICATE KEY), partitioned and bucketed

CREATE TABLE fact_orders (
  order_dt       DATE,
  order_ts       DATETIME,
  user_id        BIGINT,
  sku            BIGINT,
  qty            INT,
  price_cents    BIGINT,
  city           VARCHAR(64)
)
DUPLICATE KEY(order_dt, user_id)
PARTITION BY RANGE(order_dt) (
  PARTITION p2025_11 VALUES LESS THAN ("2025-12-01")
)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES (
  "replication_num" = "3",
  "bloom_filter_columns" = "user_id, sku"
);

Why this works: DUPLICATE stores raw facts; partition prune on date; hash bucket on user_id for joins; BloomFilter on high-selectivity columns. (Apache Doris)

Real-time ingest from Kafka (exactly-once)

CREATE ROUTINE LOAD ecommerce.load_orders ON fact_orders
COLUMNS TERMINATED BY ",",
COLUMNS(order_dt, order_ts, user_id, sku, qty, price_cents, city)
FROM KAFKA (
  "kafka_broker_list" = "broker1:9092,broker2:9092",
  "kafka_topic" = "orders",
  "property.kafka_default_offsets" = "OFFSET_BEGINNING"
);

Doris continuously consumes the topic and loads rows—no cron needed. (Apache Doris)

Roll up with an async MV (transparent rewrite)

CREATE ASYNC MATERIALIZED VIEW mv_daily_revenue AS
SELECT
  order_dt,
  SUM(qty * price_cents) AS revenue_cents,
  COUNT(DISTINCT user_id) AS buyers
FROM fact_orders
GROUP BY order_dt;

-- Later (optional) refresh strategy:
REFRESH MATERIALIZED VIEW mv_daily_revenue AUTO;

Queries like SELECT * FROM mv_daily_revenue WHERE order_dt >= '2025-11-01'; are transparently rewritten when possible. Use AUTO/COMPLETE/PARTITION per freshness needs. (Apache Doris)

Tuning playbook (keep it simple first)

Let Nereids + Pipeline do their job. Start with defaults; only tweak parallel_pipeline_task_num for special cases. (Apache Doris)
Design for pruning. Partition by your most common time filter; bucket by join/filter keys. (Apache Doris)
Co-locate heavy joins. Use colocation groups to avoid network shuffles. (Apache Doris)
Choose the right model. DUPLICATE for events, UNIQUE for upserts, AGGREGATE for pre-agg. (Apache Doris)
Index surgically. Bloom for point/range membership; Bitmap for precise dedup & segment ops; rely on built-ins first. (Apache Doris)
Replicate sanely. Default 3; align with fault domains and storage budget. (Apache Doris)

Common pitfalls (and how to dodge them)

Too many tiny tablets. Don’t explode partitions/buckets; target enough for concurrency, not mayhem. (Apache Doris)
Forgetting refresh semantics. Async MVs are not for hard real-time; pick AUTO/COMPLETE/PARTITION aligned to SLAs. (Apache Doris)
Skewed buckets. Avoid low-cardinality bucket keys; use RANDOM when distribution is unknown. (Apache Doris)
Blindly “indexing everything.” Start with model + distribution; add secondary indexes only where they pay. (Apache Doris)

Conclusion & takeaways

Doris = speed + simplicity for SQL analytics at scale.
Model with DUP/UNIQ/AGG keys, prune with partitions, spread with buckets.
Ingest via Routine Load for streams; accelerate with MVs; let Nereids + Pipeline optimize first.
Keep replication at sane defaults and co-locate heavy joins.

Call to action: Pick one workload (e.g., your clickstream or CDC feed), land it in a single Doris table using the DDL above, add a targeted MV, and measure end-to-end latency. You’ll know in a day if Doris fits.

Internal link ideas (official docs)

What is Apache Doris? (overview & architectures). (Apache Doris)
Create Table (syntax & properties incl. replication_num). (Apache Doris)
Partitioning & Bucketing (hash vs random; auto partition). (Apache Doris)
Table Models (Duplicate/Unique/Aggregate). (Apache Doris)
Index Overview (prefix/ZoneMap/secondary). (Apache Doris)
BloomFilter / Bitmap / HLL (when & how). (Apache Doris)
Routine Load (Kafka) (exactly-once stream ingest). (Apache Doris)
Materialized Views (async overview + refresh). (Apache Doris)
Pipeline Execution and Tuning Parameters (Nereids, parallelism). (Apache Doris)
Compute–Storage Decoupled (cloud objects). (Apache Doris)

Image prompt (for your designer/AI tool)

“A clean, modern data architecture diagram showing Apache Doris with FE nodes (planning/metadata) and BE nodes (storage/compute) over partition→bucket→tablet shards, streaming ingestion from Kafka (Routine Load), and an async materialized view accelerating queries — minimalistic, high contrast, 3D isometric style.”

Bonus: To understand better:

Designing Partitions & Buckets in Doris: from rules of thumb to Auto Partition for time-series. (Apache Doris)
Exactly-Once Kafka Ingestion with Routine Load: offsets, schema drift, backpressure, and SLAs. (Apache Doris)
Asynchronous Materialized Views at Scale: refresh policies, partitioned MVs, and transparent rewrite. (Apache Doris)
Colocation vs Bucket Shuffle: when colocated joins pay off and how to enforce them. (Apache Doris)
From Lake to Warehouse: querying and writing Iceberg/Hudi/Hive via Multi-Catalog, with patterns for tiered storage. (Apache Doris)
Tuning Doris for Spiky Traffic: Nereids knobs, pipeline parallelism, and MV selection heuristics. (Apache Doris)

Data/ML Engineer Blog