Designing Partitions & Buckets in Apache Doris

Designing Partitions & Buckets in Apache Doris: Rules of Thumb to Auto Partition for Time-Series

Meta description (157 chars):
A practical guide to partitioning and bucketing in Apache Doris. Learn manual vs dynamic vs auto partitioning, hash vs random buckets, and time-series patterns.


Why this matters

You want sub-second analytics on growing time-series data without burning the cluster. In Doris, the fastest path is simple: prune aggressively with partitions, spread work with buckets, keep tablets healthy. Get those three right and your queries fly; get them wrong and you’ll create hot shards, slow scans, and compaction storms.


The mental model: Partition → Bucket → Tablet

Think of a table like shelves in a warehouse:

  • Partitions = aisles (usually by time or tenant)
  • Buckets = bins in each aisle (hash or random)
  • Tablets = the physical files that Doris reads/writes

Each bucket materializes as a tablet, so the total tablets ≈ partitions × buckets—your main lever for parallelism and concurrency. Keep tablet size sane to avoid metadata bloat and painful moves. (Apache Doris)


Partitioning strategies

1) Manual partitions (predictable ranges or lists)

Use when you know your ranges (e.g., monthly partitions) or need LIST partitions (like country IN (...)).

  • Partition columns must be KEY columns; RANGE supports date/time and integers.
  • Default soft limit: 4096 partitions per table (adjustable via FE config).
  • If you don’t specify partitions, Doris creates a hidden full-range partition. (Apache Doris)

Example (monthly RANGE):

CREATE TABLE events (
  event_time DATETIME,
  user_id BIGINT,
  action VARCHAR(32)
)
DUPLICATE KEY(event_time, user_id)
PARTITION BY RANGE(event_time) (
  PARTITION p2025_11 VALUES LESS THAN ("2025-12-01"),
  PARTITION p2025_12 VALUES LESS THAN ("2026-01-01")
);

Manual partitions are perfect for stable calendars and when governance wants explicit ranges. (Apache Doris)


2) Dynamic partitioning (rolling TTL for time-series)

Use Dynamic Partitioning to auto-create upcoming partitions and drop expired ones based on a time unit (HOUR/DAY/WEEK/MONTH/YEAR). It’s ideal for hot-window analytics with lifecycle management. Configure via dynamic_partition.* properties on RANGE partitions. (Apache Doris)

Example (keep yesterday → +2 days, daily):

CREATE TABLE logs (
  event_date DATE,
  user_id BIGINT,
  payload JSON
)
DUPLICATE KEY(event_date, user_id)
PARTITION BY RANGE(event_date) ()
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES (
  "dynamic_partition.enable" = "true",
  "dynamic_partition.time_unit" = "DAY",
  "dynamic_partition.start" = "-1",
  "dynamic_partition.end"   = "2",
  "dynamic_partition.prefix" = "p",
  "dynamic_partition.create_history_partition" = "true"
);

This keeps a sliding set of partitions fresh while deleting old data automatically. (Apache Doris)


3) Auto Partition (when values are unpredictable)

When you cannot enumerate partition values (late-arriving historical dates, irregular keys, multi-tenant IDs), enable Auto Partition. Doris will create missing partitions on ingest. You can choose AUTO RANGE (time-aware with date_trunc) or AUTO LIST. (Apache Doris)

AUTO RANGE (by month):

CREATE TABLE fact_orders (
  order_ts DATETIMEV2, order_id BIGINT, user_id BIGINT
)
DUPLICATE KEY(order_ts, order_id)
AUTO PARTITION BY RANGE (date_trunc(order_ts, 'month')) ()
DISTRIBUTED BY HASH(user_id) BUCKETS 16;

AUTO LIST (categorical keys):

CREATE TABLE clicks_by_tenant (
  tenant STRING, event_date DATE, uid BIGINT
)
DUPLICATE KEY(tenant, event_date, uid)
AUTO PARTITION BY LIST(tenant) ()
DISTRIBUTED BY HASH(uid) BUCKETS 16;

Use Auto Partition when data decides the partitions, not you. Constraints: AUTO RANGE currently supports date_trunc on DATE/DATETIME; AUTO LIST builds a partition per new value. (Apache Doris)


Bucketing strategies

Hash vs Random bucketing

  • Hash Bucketing: choose one or more bucket keys; equality filters/pruning get faster; great for joins on the bucket key.
  • Random Bucketing: Doris scatters batches randomly to avoid skew when no good key exists; DUPLICATE tables only; no bucket-key pruning. (Apache Doris)

Hash example:

... DISTRIBUTED BY HASH(user_id) BUCKETS 32;

Random example (DUPLICATE tables only):

... DISTRIBUTED BY RANDOM BUCKETS 32;

Prefer Hash when you filter/join by a high-cardinality column; prefer Random for ad-hoc workloads with skewed keys. (Apache Doris)


How many buckets?

Two practical levers:

  1. Tablet size: target ~1–10 GB per tablet (too small → overhead; too large → slow moves/retries).
  2. Count: having slightly more tablets than total disks helps spread IO. (Apache Doris)

In DDL, you set bucket count explicitly, or let Doris estimate with BUCKETS AUTO and (optionally) estimate_partition_size (defaults to 10 GB if not set). (Apache Doris)

Examples:

-- Fixed
DISTRIBUTED BY HASH(user_id) BUCKETS 24;

-- Automatic with estimate
DISTRIBUTED BY HASH(user_id) BUCKETS AUTO
PROPERTIES("estimate_partition_size"="20G");

Limitations: You can only change bucket numbers for newly added partitions (not existing ones), and you cannot change the bucketing type or bucket key after creation. Plan ahead. (Apache Doris)


Time-series recipe: from raw events to fast queries

  1. Choose the partitioning mode
    • Predictable calendar & hard SLAs? Manual RANGE (months/days).
    • Sliding window with TTL? Dynamic Partition.
    • Irregular historical loads or unknown tenants? Auto Partition. (Apache Doris)
  2. Pick buckets
    • Start with HASH(user_id) for user-centric joins/filters.
    • If keys are skewed or unknown, use RANDOM on DUPLICATE tables.
    • Begin with BUCKETS AUTO if you don’t know sizes; revisit once you see tablet sizes. (Apache Doris)
  3. DDL template (daily, dynamic partition + auto buckets)
CREATE TABLE ts_events (
  event_dt DATE,
  event_ts DATETIMEV2,
  user_id BIGINT,
  metric DOUBLE
)
DUPLICATE KEY(event_dt, user_id)
PARTITION BY RANGE(event_dt) ()
DISTRIBUTED BY HASH(user_id) BUCKETS AUTO
PROPERTIES (
  "estimate_partition_size" = "20G",
  "dynamic_partition.enable" = "true",
  "dynamic_partition.time_unit" = "DAY",
  "dynamic_partition.start" = "-7",
  "dynamic_partition.end"   = "3",
  "dynamic_partition.prefix" = "p",
  "dynamic_partition.create_history_partition" = "true"
);

Operations you’ll actually use

  • Inspect partitions: SHOW PARTITIONS FROM db.table; or the PARTITIONS() table-valued function for quick lists. (Apache Doris)
  • Scale new partitions: override bucket count on added partitions: ALTER TABLE ts_events ADD PARTITION p2025_12 VALUES LESS THAN ("2026-01-01") DISTRIBUTED BY HASH(user_id) BUCKETS 32; ``` :contentReference[oaicite:15]{index=15}
  • Speed up joins: align bucket keys and counts, and set PROPERTIES("colocate_with"="orders_group") to co-locate heavy join tables (local joins, no shuffle). (Apache Doris)

Rules of thumb (and pitfalls to avoid)

  • Partition by what you filter most (usually time). Avoid per-row or per-user partitions. Respect the default 4096-partition cap unless you’ve planned metadata/ops. (Apache Doris)
  • Bucket by high-cardinality join/filter keys; if skewed/unpredictable, RANDOM on DUPLICATE tables is safer. (Apache Doris)
  • Start with BUCKETS AUTO; confirm tablets land in the 1–10 GB window; adjust later for new partitions. (Apache Doris)
  • You can’t retrofit bucketing on existing partitions (type/key/count). Decide before you bulk-load. (Apache Doris)
  • Co-locate star/galaxy schemas to avoid distributed joins when possible. (Apache Doris)

Conclusion & takeaways

  • Partition to prune, bucket to parallelize, size tablets to stabilize.
  • Pick Manual for predictable calendars, Dynamic for rolling TTL windows, Auto Partition when data is messy and late.
  • Use HASH when you have a solid high-cardinality key; use RANDOM to defuse skew.
  • Start simple, measure, then refine new partitions.

Call to action: Pick one hot time-series table, enable Dynamic or Auto Partition, switch to BUCKETS AUTO, and watch scan time and compaction stabilize over a week.


Internal link ideas (official docs)

  • Manual partitioning (RANGE/LIST, limits, examples). (Apache Doris)
  • Dynamic Partitioning (sliding windows, TTL, properties). (Apache Doris)
  • Auto Partition (AUTO RANGE/LIST, ingest-time creation). (Apache Doris)
  • Data Bucketing (HASH vs RANDOM, BUCKETS AUTO, tablet sizing). (Apache Doris)
  • Colocation Join (properties and tuning). (Apache Doris)
  • PARTITIONS() table function / SHOW PARTITIONS. (Apache Doris)

Image prompt (for your designer/AI tool)

“A clean, modern architecture diagram of an Apache Doris table showing partitions by month, each split into hash or random buckets, expanding to tablets (1–10 GB), with insets for Dynamic Partition (rolling TTL) and Auto Partition (ingest-time creation). Minimalistic, high contrast, 3D isometric style.”


Tags

#ApacheDoris #DataPartitioning #Bucketing #TimeSeries #OLAP #DataEngineering #Scalability #QueryPerformance #BigData #Architecture

Leave a Reply

Your email address will not be published. Required fields are marked *