Choosing Between DUPLICATE, UNIQUE, and AGGREGATE in VeloDB (Apache Doris): A Field Guide for Data Engineers
Meta description (158 chars):
Learn when to use DUPLICATE, UNIQUE, or AGGREGATE tables in VeloDB (Apache Doris). See patterns, SQL examples, trade-offs, and pitfalls for real-time analytics.
Why this matters
Picking the wrong table model locks you into painful rewrites later. In VeloDB (powered by Apache Doris), the table model defines how data is stored, updated, and aggregated—which drives latency, cost, and correctness. Choose wisely up front, then scale cleanly. (Apache Doris)
The three models at a glance
| Model | What it does | Use when | You get | You give up |
|---|---|---|---|---|
| DUPLICATE | Stores raw rows; sort by key only | Append-only events, exploratory analytics | Simplicity, flexible queries | No automatic dedup/updates |
| UNIQUE | UPSERT by key (last write wins) | Slowly changing entities, idempotent ingestion | Fresh “latest state” rows | Order of replacements isn’t guaranteed across batches |
| AGGREGATE | Pre-aggregates by key using functions (SUM/MIN/MAX/REPLACE) | Rollups, metrics tables | Smaller storage + faster scans | Raw detail is discarded |
Sources: official Doris/VeloDB docs. (Apache Doris)
Concept & architecture (clear mental model)
Think in Key columns and Value columns. Key columns (the ones after DUPLICATE KEY, UNIQUE KEY, or AGGREGATE KEY) drive sort/group semantics; value columns are aggregated or replaced depending on model. (Apache Doris)
- DUPLICATE: keys define sort order only. Multiple rows with same key can coexist. Keep sort keys short (≤3 cols) for scan efficiency. (Apache Doris)
- UNIQUE: keys are unique; new loads upsert—overwrite existing rows with same key. Great for “latest state” tables. (Apache Doris)
- AGGREGATE: value columns have aggregate functions; storage keeps only aggregated results, not raw rows. (Apache Doris)
Quick decision rubric (use this in design reviews)
- Will you ever update past records?
- Yes → start with UNIQUE.
- No, append only → DUPLICATE or AGGREGATE.
- Do queries mostly need rollups (time buckets, counts, sums)?
- Yes → AGGREGATE (pre-aggregate to cut cost/latency).
- No, detail exploration → DUPLICATE.
- Do you need raw history available?
- Yes → DUPLICATE (or keep a raw DUPLICATE alongside an AGGREGATE MV).
- No → AGGREGATE.
- Are you merging multiple sources with late or conflicting updates?
- Yes → UNIQUE, but be aware: replacement order isn’t guaranteed across batches—design merges to be idempotent. (Apache Doris)
SQL you can copy (minimal, production-leaning)
1) DUPLICATE: raw events (append-only)
CREATE TABLE events_clicks (
event_time DATETIME NOT NULL,
user_id BIGINT NOT NULL,
session_id BIGINT NOT NULL,
page VARCHAR(200),
revenue_usd DECIMAL(12,2)
)
DUPLICATE KEY(event_time, user_id)
PARTITION BY RANGE(event_time) (
FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(user_id) BUCKETS 16;
Notes: DUPLICATE is the default model; choose ≤3 sort key columns to keep scans fast. (Apache Doris)
2) UNIQUE: latest customer state (UPSERT)
CREATE TABLE dim_customer_latest (
customer_id BIGINT NOT NULL,
name STRING,
tier STRING,
updated_at DATETIME
)
UNIQUE KEY(customer_id)
PARTITION BY RANGE(updated_at) (
FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 16;
All standard load methods support UPSERT for UNIQUE: if the key exists, the row is replaced; otherwise it’s inserted. Design your pipelines to be idempotent. (Apache Doris)
3) AGGREGATE: pre-aggregated metrics
CREATE TABLE metrics_orders_daily (
order_date DATE NOT NULL,
merchant_id BIGINT NOT NULL,
orders BIGINT SUM,
revenue_usd DECIMAL(18,2) SUM,
max_order DECIMAL(18,2) MAX
)
AGGREGATE KEY(order_date, merchant_id)
PARTITION BY RANGE(order_date) (
FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(merchant_id) BUCKETS 16;
Here, storage keeps aggregated values only, cutting scans and cost for dashboard queries. (Apache Doris)
Real example: choose the right model for a streaming dashboard
Scenario: You ingest Kafka orders and power a real-time revenue dashboard plus drill-downs.
- Raw firehose (for data science & audits) →
orders_eventsas DUPLICATE. - Operational views (customer’s latest tier, KYC flags) →
dim_customer_latestas UNIQUE (UPSERT). - Top-N and time-bucket charts →
metrics_orders_1minas AGGREGATE (or create an async MV over the raw table).
This hybrid keeps your detail and your speed without turning every query into an expensive scan. (Materialization guidance is in Doris docs.) (Apache Doris)
Best practices (do these first)
- Model to the write pattern. If you don’t truly need updates, do not pick UNIQUE; DUPLICATE is cheaper and simpler. (Apache Doris)
- Align partitions to SLA. Daily partitions are a sane default; only go hourly when query SLAs demand it.
- Keep sort keys tight. Start with time + one high-cardinality dimension; avoid long composite keys. (Apache Doris)
- Pre-aggregate intentionally. If 80% of queries hit the same rollup, make it AGGREGATE (or an MV) and let Doris rewrite queries. (Apache Doris)
- Design for idempotency in UNIQUE. Since replacement order across batches isn’t guaranteed, avoid logic that depends on arrival order. (Apache Doris)
Common pitfalls (don’t learn these the hard way)
- Using UNIQUE for event streams just to “dedup later.” You’ll pay more and still fight late-arriving events. Prefer DUPLICATE + downstream dedup or AGGREGATE if you only need rollups. (Apache Doris)
- Over-aggregating early. If analysts need detail, AGGREGATE will frustrate them. Keep a raw DUPLICATE table (or lake table) alongside. (Apache Doris)
- Huge sort keys. They slow writes and scans. Trim to what helps pruning most. (Apache Doris)
- Assuming UPSERT preserves order. It doesn’t; build deterministic merges. (Apache Doris)
FAQ (quick hits)
- Can I switch models later? No—model is fixed at create time. Plan ahead and keep raw detail to enable rebuilds. (Apache Doris)
- Where does VeloDB differ from “vanilla” Doris? VeloDB is a managed distribution with SaaS/BYOC options; modeling semantics follow Doris. (VeloDB Docs)
Summary & call-to-action
- DUPLICATE → raw, flexible exploration.
- UNIQUE → latest state via UPSERT.
- AGGREGATE → pre-computed rollups for speed.
Start with a single use case: model it using this rubric, load a day of data, and benchmark end-to-end (ingest → query). If latency or cost miss the mark, adjust partitioning, sort keys, or model—not just horsepower.
Internal link ideas (for your site)
- “Designing Sort Keys & Partitions in Doris/VeloDB”
- “Async Materialized Views: Patterns for Real-Time Rollups”
- “Kafka → VeloDB: Streaming Ingestion Options”
- “BYOC vs. SaaS Deployments for Real-Time Analytics”
Image prompt (for DALL·E / Midjourney)
“A clean, modern diagram comparing three table models in VeloDB/Doris—DUPLICATE, UNIQUE, AGGREGATE—with arrows showing write paths and query flows; minimalistic, high-contrast, isometric style.”
Tags
#VeloDB #ApacheDoris #DataModeling #RealTimeAnalytics #UPSERT #PreAggregation #DataEngineering #OLAP #Kafka #SQL
All references are to official VeloDB/Apache Doris docs and site. (Apache Doris)










Leave a Reply