Choosing Between DUPLICATE, UNIQUE, and AGGREGATE in VeloDB (Apache Doris): A Field Guide for Data Engineers

Meta description (158 chars):
Learn when to use DUPLICATE, UNIQUE, or AGGREGATE tables in VeloDB (Apache Doris). See patterns, SQL examples, trade-offs, and pitfalls for real-time analytics.

Why this matters

Picking the wrong table model locks you into painful rewrites later. In VeloDB (powered by Apache Doris), the table model defines how data is stored, updated, and aggregated—which drives latency, cost, and correctness. Choose wisely up front, then scale cleanly. (Apache Doris)

The three models at a glance

Model	What it does	Use when	You get	You give up
DUPLICATE	Stores raw rows; sort by key only	Append-only events, exploratory analytics	Simplicity, flexible queries	No automatic dedup/updates
UNIQUE	UPSERT by key (last write wins)	Slowly changing entities, idempotent ingestion	Fresh “latest state” rows	Order of replacements isn’t guaranteed across batches
AGGREGATE	Pre-aggregates by key using functions (SUM/MIN/MAX/REPLACE)	Rollups, metrics tables	Smaller storage + faster scans	Raw detail is discarded

Sources: official Doris/VeloDB docs. (Apache Doris)

Concept & architecture (clear mental model)

Think in Key columns and Value columns. Key columns (the ones after DUPLICATE KEY, UNIQUE KEY, or AGGREGATE KEY) drive sort/group semantics; value columns are aggregated or replaced depending on model. (Apache Doris)

DUPLICATE: keys define sort order only. Multiple rows with same key can coexist. Keep sort keys short (≤3 cols) for scan efficiency. (Apache Doris)
UNIQUE: keys are unique; new loads upsert—overwrite existing rows with same key. Great for “latest state” tables. (Apache Doris)
AGGREGATE: value columns have aggregate functions; storage keeps only aggregated results, not raw rows. (Apache Doris)

Quick decision rubric (use this in design reviews)

Will you ever update past records?

Yes → start with UNIQUE.
No, append only → DUPLICATE or AGGREGATE.

Do queries mostly need rollups (time buckets, counts, sums)?

Yes → AGGREGATE (pre-aggregate to cut cost/latency).
No, detail exploration → DUPLICATE.

Do you need raw history available?

Yes → DUPLICATE (or keep a raw DUPLICATE alongside an AGGREGATE MV).
No → AGGREGATE.

Are you merging multiple sources with late or conflicting updates?

Yes → UNIQUE, but be aware: replacement order isn’t guaranteed across batches—design merges to be idempotent. (Apache Doris)

SQL you can copy (minimal, production-leaning)

1) DUPLICATE: raw events (append-only)

CREATE TABLE events_clicks (
  event_time   DATETIME NOT NULL,
  user_id      BIGINT   NOT NULL,
  session_id   BIGINT   NOT NULL,
  page         VARCHAR(200),
  revenue_usd  DECIMAL(12,2)
)
DUPLICATE KEY(event_time, user_id)
PARTITION BY RANGE(event_time) (
  FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(user_id) BUCKETS 16;

Notes: DUPLICATE is the default model; choose ≤3 sort key columns to keep scans fast. (Apache Doris)

2) UNIQUE: latest customer state (UPSERT)

CREATE TABLE dim_customer_latest (
  customer_id  BIGINT      NOT NULL,
  name         STRING,
  tier         STRING,
  updated_at   DATETIME
)
UNIQUE KEY(customer_id)
PARTITION BY RANGE(updated_at) (
  FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 16;

All standard load methods support UPSERT for UNIQUE: if the key exists, the row is replaced; otherwise it’s inserted. Design your pipelines to be idempotent. (Apache Doris)

3) AGGREGATE: pre-aggregated metrics

CREATE TABLE metrics_orders_daily (
  order_date   DATE    NOT NULL,
  merchant_id  BIGINT  NOT NULL,
  orders       BIGINT  SUM,
  revenue_usd  DECIMAL(18,2) SUM,
  max_order    DECIMAL(18,2) MAX
)
AGGREGATE KEY(order_date, merchant_id)
PARTITION BY RANGE(order_date) (
  FROM ("2025-11-01") TO ("2025-12-01") INTERVAL 1 DAY
)
DISTRIBUTED BY HASH(merchant_id) BUCKETS 16;

Here, storage keeps aggregated values only, cutting scans and cost for dashboard queries. (Apache Doris)

Real example: choose the right model for a streaming dashboard

Scenario: You ingest Kafka orders and power a real-time revenue dashboard plus drill-downs.

Raw firehose (for data science & audits) → orders_events as DUPLICATE.
Operational views (customer’s latest tier, KYC flags) → dim_customer_latest as UNIQUE (UPSERT).
Top-N and time-bucket charts → metrics_orders_1min as AGGREGATE (or create an async MV over the raw table).
This hybrid keeps your detail and your speed without turning every query into an expensive scan. (Materialization guidance is in Doris docs.) (Apache Doris)

Best practices (do these first)

Model to the write pattern. If you don’t truly need updates, do not pick UNIQUE; DUPLICATE is cheaper and simpler. (Apache Doris)
Align partitions to SLA. Daily partitions are a sane default; only go hourly when query SLAs demand it.
Keep sort keys tight. Start with time + one high-cardinality dimension; avoid long composite keys. (Apache Doris)
Pre-aggregate intentionally. If 80% of queries hit the same rollup, make it AGGREGATE (or an MV) and let Doris rewrite queries. (Apache Doris)
Design for idempotency in UNIQUE. Since replacement order across batches isn’t guaranteed, avoid logic that depends on arrival order. (Apache Doris)

Common pitfalls (don’t learn these the hard way)

Using UNIQUE for event streams just to “dedup later.” You’ll pay more and still fight late-arriving events. Prefer DUPLICATE + downstream dedup or AGGREGATE if you only need rollups. (Apache Doris)
Over-aggregating early. If analysts need detail, AGGREGATE will frustrate them. Keep a raw DUPLICATE table (or lake table) alongside. (Apache Doris)
Huge sort keys. They slow writes and scans. Trim to what helps pruning most. (Apache Doris)
Assuming UPSERT preserves order. It doesn’t; build deterministic merges. (Apache Doris)

FAQ (quick hits)

Can I switch models later? No—model is fixed at create time. Plan ahead and keep raw detail to enable rebuilds. (Apache Doris)
Where does VeloDB differ from “vanilla” Doris? VeloDB is a managed distribution with SaaS/BYOC options; modeling semantics follow Doris. (VeloDB Docs)

Summary & call-to-action

DUPLICATE → raw, flexible exploration.
UNIQUE → latest state via UPSERT.
AGGREGATE → pre-computed rollups for speed.

Start with a single use case: model it using this rubric, load a day of data, and benchmark end-to-end (ingest → query). If latency or cost miss the mark, adjust partitioning, sort keys, or model—not just horsepower.

Internal link ideas (for your site)

“Designing Sort Keys & Partitions in Doris/VeloDB”
“Async Materialized Views: Patterns for Real-Time Rollups”
“Kafka → VeloDB: Streaming Ingestion Options”
“BYOC vs. SaaS Deployments for Real-Time Analytics”

Image prompt (for DALL·E / Midjourney)

“A clean, modern diagram comparing three table models in VeloDB/Doris—DUPLICATE, UNIQUE, AGGREGATE—with arrows showing write paths and query flows; minimalistic, high-contrast, isometric style.”

Data/ML Engineer Blog