Data-Skipping Indexes That Actually Help

Data-Skipping Indexes That Actually Help: Bloom, Token, and N-Gram Explained (with Benchmarks)


Why you should care (a quick story)

You’ve got a log table with billions of rows. Dashboards filter on status, users search error messages with LIKE, and your queries are guzzling terabytes. ClickHouse can be lightning-fast, but only if it can skip work. That’s what data-skipping indexes do: they let ClickHouse avoid reading blocks of data that cannot match your predicate. Used well, they cut scanned bytes by orders of magnitude; used blindly, they add overhead. Let’s make them work for you. (ClickHouse)


What “data-skipping index” means in ClickHouse

  • Works on MergeTree tables and stores lightweight metadata per granule (default ~8192 rows) so the engine can skip blocks during reads.
  • You define: name, expression, TYPE, GRANULARITY.
  • They don’t return row locations (not B-trees). They just answer: “Is it even worth reading this block?” (ClickHouse)

The three Bloom-based options (and when to use each)

IndexBest forTypical operators / functionsNotes
bloom_filter([p_false])Set membership on non-key columns (=, IN, rare values)= IN (on the expression)Simple to add; optional false-positive rate (default 0.025). (ClickHouse)
tokenbf_v1(bytes, k, seed)Word-level search in strings/logsLIKE (wordish), = IN, hasToken*Splits by non-alphanumeric tokens; String/FixedString/Map. Size & hash count matter. (ClickHouse)
ngrambf_v1(n, bytes, k, seed)Substring search (%foo%, languages w/o word breaks)LIKE '%…%', hasSubstr*Indexes n-grams (character windows). Tune n; supports String/FixedString/Map. (ClickHouse)

Rule of thumb: tokenbf_v1 for word tokens (logs/tags), ngrambf_v1 for free-form substrings, bloom_filter for non-text membership checks (IDs, codes). Always test on your data shape. (ClickHouse)


Minimal, copy-paste-ready examples

1) bloom_filter on a status code

CREATE TABLE logs_raw
(
  ts       DateTime,
  status   UInt16,
  user_id  UInt64,
  msg      String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (ts, user_id);

ALTER TABLE logs_raw
  ADD INDEX ix_status status TYPE bloom_filter(0.01) GRANULARITY 2;

-- If the table already has data:
ALTER TABLE logs_raw MATERIALIZE INDEX ix_status;

The index helps when you filter on rare statuses: WHERE status IN (500, 503). (ClickHouse)

2) tokenbf_v1 for word search (case-folded)

ALTER TABLE logs_raw ADD INDEX ix_token
  lowerUTF8(msg) TYPE tokenbf_v1(65536, 3, 42) GRANULARITY 2;

-- Query examples that can use it:
SELECT * FROM logs_raw WHERE hasToken(lowerUTF8(msg), 'timeout');
SELECT * FROM logs_raw WHERE msg ILIKE '% connection %';

Token index splits by non-alphanumeric boundaries; works well for word searches in log text. Tune bytes/hash functions empirically. (ClickHouse)

3) ngrambf_v1 for substring (%…%)

ALTER TABLE logs_raw ADD INDEX ix_ngram
  msg TYPE ngrambf_v1(3, 1048576, 3, 123) GRANULARITY 2;

-- Good for substring lookups:
SELECT count() FROM logs_raw WHERE msg LIKE '%conn err%';

Pick n based on queries: n=3 matches 3-char substrings; smaller n increases index size/FPs. (ClickHouse)


“Explained with Benchmarks”: a 10-minute, reproducible micro-benchmark

Goal: show how indexes reduce read_rows/read_bytes and how many granules they drop.

  1. Create sample data (1–10M rows is fine).
  2. Run the query without the index; capture metrics.
  3. Add the index + MATERIALIZE INDEX (for existing parts).
  4. Rerun; compare metrics.

Measure with EXPLAIN & query log

-- See which indexes would be used:
EXPLAIN indexes=1
SELECT * FROM logs_raw WHERE hasToken(lowerUTF8(msg), 'timeout');

-- Then run and inspect the log:
SYSTEM FLUSH LOGS;
SELECT
  query_duration_ms, read_rows, read_bytes
FROM system.query_log
WHERE query ILIKE '%hasToken%' AND type = 'QueryFinished'
ORDER BY event_time DESC
LIMIT 1;

EXPLAIN indexes=1 shows parts/granules before/after and which skip index fired; system.query_log has canonical read_rows/read_bytes to verify impact. (ClickHouse)

What the official docs show: a simple skip-index example reduces work from 100M rows / 800 MB scanned to just 32,768 rows / 360 KB—because most granules were skipped. Expect different numbers on your data, but the pattern is what matters. (ClickHouse)


How to choose between bloom, token, and n-gram (with visuals in mind)

  • Think in blocks, not rows. If a value occurs at least once in a block, ClickHouse must still read that block—even if only one row matches. Skipping helps when your predicate is sparse within blocks. (ClickHouse)
  • Correlate with ORDER BY. Indexes work best when your non-key column loosely correlates with the primary key (e.g., certain error codes cluster in certain time ranges). Otherwise, almost every block contains matches and nothing is skipped. (ClickHouse)
  • Pick the right flavor:
    • Rare IDs/codesbloom_filter.
    • Word search (logs/tags) ⇒ tokenbf_v1 on lowerUTF8(msg).
    • Substring / languages without spacesngrambf_v1 with smart n. (ClickHouse)

Tuning checklist (fast to apply, easy to undo)

  • Start simple: bloom_filter(0.01) on the column you filter by most, then validate with EXPLAIN indexes=1. (ClickHouse)
  • Materialize after adding to existing tables: ALTER TABLE … MATERIALIZE INDEX …. (ClickHouse)
  • Granularity: begin at 2 (i.e., index block ≈ 2 * 8192 rows); raise if lookup overhead dominates, lower if skipping is too coarse. (ClickHouse)
  • Token/N-gram sizing:
    • tokenbf_v1(bytes, k, seed) and ngrambf_v1(n, bytes, k, seed) require sizing; bytes ↑ → fewer false positives, more disk. Use the docs’ guidance/calculator to estimate. (ClickHouse)
  • Verify impact: track read_rows, read_bytes, and granules dropped in EXPLAIN. If numbers don’t move, the index isn’t helping. (ClickHouse)
  • Know when to bail: if almost every block matches, the index adds overhead—fix data layout (ORDER BY) or use materialized views/projections instead. (ClickHouse)

Common pitfalls (and fixes)

PitfallWhy it hurtsFix
Adding skip indexes before fixing ORDER BYPoor clustering ⇒ few blocks skippedAlign ORDER BY with common predicates; then index what’s left
Expecting LIKE '%x%' to be fast with tokenbf_v1Token index is word-basedUse ngrambf_v1 (substring) or change query to tokens
Granularity too highCheap lookup, but too coarse to skipLower GRANULARITY (1–4) and re-test
Wrong casingCase mismatch ⇒ index not usedCreate index on lowerUTF8(col) and query with lowerUTF8()
Forgetting to MATERIALIZEOld parts lack indexALTER TABLE … MATERIALIZE INDEX …
Measuring by latency onlyOS cache, concurrency hide effectsCompare read_rows/read_bytes & granules dropped consistently (ClickHouse)

Quick reference: when each wins

  • bloom_filter: WHERE error_code IN (502,503), WHERE country='IS'rare values on non-key columns.
  • tokenbf_v1: log/tags: hasToken(lowerUTF8(msg),'timeout'), ILIKE '% timeout %' on token boundaries.
  • ngrambf_v1: free text / partial matches: LIKE '%conn err%', languages without spaces. (ClickHouse)

Internal link ideas (official docs)

  • Understanding data-skipping indexes (concepts, types, best practices). (ClickHouse)
  • Skip index examples (syntax for bloom/set/minmax/token/n-gram + verification). (ClickHouse)
  • EXPLAIN … indexes=1 reference (see parts/granules before/after). (ClickHouse)
  • system.query_log fields (read_rows, read_bytes, duration). (ClickHouse)
  • “Use data-skipping indices where appropriate” (pragmatic guidance). (ClickHouse)

Summary

Skip indexes are not magic; they’re pruning hints. Use them after you’ve picked a good ORDER BY. Start with a simple bloom_filter for rare values, then add tokenbf_v1 (tokens) or ngrambf_v1 (substrings) where text search is hot. Prove value with EXPLAIN and system.query_log. If scanned bytes don’t drop, rethink the approach. (ClickHouse)

Call to action: Pick one painful text filter today. Add a single bloom_filter or ngrambf_v1, materialize, and measure read_rows before/after. Keep what pays rent—drop what doesn’t.


Image prompt

“A clean, modern diagram showing ClickHouse MergeTree blocks and skip indexes dropping granules: three panels for bloom_filter, tokenbf_v1, and ngrambf_v1; highlight pruned granules vs scanned; minimalistic, high contrast, isometric 3D.”

Tags

#ClickHouse #SkippingIndex #BloomFilter #tokenbf #ngrambf #Performance #DataEngineering #OLAP #Logs #Benchmarks

Leave a Reply

Your email address will not be published. Required fields are marked *