Vector Search in ClickHouse

Vector Search in ClickHouse: HNSW + Metadata Filtering with Pure SQL


Why this matters

Your team wants RAG-style search that respects user filters (region, product, role) and returns results in milliseconds. You could add a separate vector DB—or you can keep embeddings, analytics, and filtering in one place: ClickHouse. With its HNSW vector index and SQL, you get semantic search + classic filters without a new service to run.


Concept & architecture (mental model first)

Exact vs approximate search

  • Exact (brute force): ORDER BY distance(embedding, :query_vec) scans all rows; recall=1, slower at scale.
  • Approximate (HNSW index): a graph-based structure narrows to likely neighbors for 10–100× speedups with tunable recall. In ClickHouse this is a vector similarity index of type HNSW built over Array(Float*) columns. ClickHouse

How ClickHouse wires it

  • Store embeddings in Array(Float32 | Float64 | BFloat16) (fixed dimension). Vector indexes are defined per-column and support HNSW with optional quantization (e.g., bf16, f16, i8, b1). Available in v25.8+. ClickHouse
  • Use distance functions (cosineDistance, L2Distance, etc.) in ORDER BY and LIMIT k. The distance in your query must match the index’s distance to be used. Check with EXPLAIN indexes = 1. ClickHouse+1

Filtering (pure SQL)

  • ClickHouse evaluates vector search plus your WHERE filters using pre-/post-filtering, and can prune partitions/primary-key ranges first. That’s how you combine ANN + metadata filters cleanly. ClickHouse

Schema & index: a production-ready starter

-- 1) Documents with metadata + embedding
CREATE TABLE app.docs
(
  id           UInt64,
  customer_id  UInt64,
  lang         LowCardinality(String),
  created_at   DateTime('UTC'),
  title        String,
  body         String,
  -- Fixed-dimension embedding (e.g., 1536)
  embedding    Array(Float32),

  -- Enforce fixed length at write time (recommended)
  CONSTRAINT dim_ok CHECK length(embedding) = 1536,

  INDEX hnsw_vec embedding
    TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 32, 512)
    --          ^type           ^distance            ^dims  ^quant ^M   ^ef_construction
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (customer_id, created_at);

Notes that save you time:

  • Arrays must have the exact dimension you declare; index supports single columns with non-null, non-empty arrays. ClickHouse
  • The HNSW index builds on inserts/merges; you can defer/index later and materialize when the cluster is quiet. ClickHouse

Query patterns: ANN + metadata filters

Top-k semantic search (post-filter)

WITH :q AS query_vec
SELECT id, title,
       cosineDistance(embedding, query_vec) AS dist
FROM app.docs
WHERE lang = 'en'          -- metadata filter
  AND created_at >= now() - INTERVAL 30 DAY
ORDER BY dist
LIMIT 10
SETTINGS hnsw_candidate_list_size_for_search = 512;  -- ef_search
  • Query template that triggers the index is: ORDER BY <Distance>(column, :query) LIMIT k. Distance must match index definition. ClickHouse
  • Tune recall/latency via hnsw_candidate_list_size_for_search (aka ef_search). ClickHouse

Guarantee index usage (debug)

EXPLAIN indexes = 1
SELECT ...

Look for a Skip entry showing vector_similarity on your index. ClickHouse

Exact fallback (no index)

WITH :q AS query_vec
SELECT id, title
FROM app.docs
ORDER BY L2Distance(embedding, query_vec)
LIMIT 10;

Distance functions are built-in; choose cosine for normalized embeddings, L2 for Euclidean. ClickHouse


Tuning & capacity planning (the 80/20)

Index build & memory

  • Index must be fully in memory for searches. Docs provide formulas to estimate memory for vectors and the HNSW graph; use them before turning on large-scale ANN. ClickHouse
  • Default quantization is bf16; consider i8/b1 only when RAM is tight—enable rescoring to recover accuracy. ClickHouse
  • Cache index granules with vector_similarity_index_cache_size (default up to 5 GB). Size it to your index granule footprint. ClickHouse

Operational safety rails

  • Ensure server compatibility is default/≥ ‘25.1’ for index use. ClickHouse
  • ClickHouse guards insane fan-outs with max_limit_for_vector_search_queries (default 100). Keep LIMIT sane. ClickHouse
  • HNSW builds can slow merges. Consider deferring with materialize_skip_indexes_on_merge = 0 and ALTER ... MATERIALIZE INDEX during off-hours. ClickHouse

Filtering strategy

  • Partition key filters (e.g., month) prune early → less data to consider.
  • Primary key prefix filters can be evaluated before vector ranking; otherwise ClickHouse applies post-filtering (may return < k rows if many candidates fail filters). ClickHouse

End-to-end mini workflow (repro in minutes)

  1. Create table and HNSW index (see DDL above).
  2. Load sample embeddings (e.g., OpenAI 1536-dim) as Array(Float32); enforce the CHECK constraint to catch bad rows. ClickHouse
  3. Query with filters: WHERE customer_id = :cid AND lang = 'en' then ORDER BY cosineDistance(...) LIMIT 10.
  4. Verify usage with EXPLAIN indexes = 1.
  5. Tune ef_search for the latency/recall you need.

Best practices (and the traps to avoid)

Do this

  • Normalize vectors for cosine distance; keep distance in index = distance in queries. ClickHouse+1
  • Start with defaults: M=32, ef_construction=128, ef_search=256 and measure. ClickHouse
  • Keep Array(Float32) unless accuracy demands Float64; BFloat16 can cut RAM while preserving quality for many models. ClickHouse
  • Use EXPLAIN and index cache metrics to confirm behavior. ClickHouse

Avoid this

  • Over-filtering first: if filters don’t hit partition/primary key, ClickHouse may post-filter after ANN, shrinking results. Plan keys accordingly. ClickHouse
  • Mismatched distance (index built on L2 but query orders by cosineDistance)—the index won’t be used. ClickHouse
  • Building HNSW on fast-changing data; prefer immutable/append-heavy sets or materialize during low load. ClickHouse

Comparison cheat-sheet

AspectExact scan (no index)HNSW vector index
Recall100%Tunable (via ef_search)
LatencyLinear in rowsSub-linear, typically 10–100× faster
RAMMinimalIndex must fit (after quantization)
WritesFastSlower; build cost on inserts/merges
Best forSmall sets, cold layers, offline reranksHot search, interactive UIs

(Index behaviors, tuning knobs, and caveats per official docs.) ClickHouse


Internal link ideas (official docs only)

  • Exact & Approximate Vector Search (HNSW, syntax, filters, tuning) — ClickHouse Docs. ClickHouse
  • Distance functions (cosineDistance, L2Distance, …) — SQL reference. ClickHouse
  • Example dataset: Hacker News vector search — quick start SQL. ClickHouse

Summary & next step

ClickHouse lets you embed vector search where your data already lives: define an HNSW vector similarity index on Array(Float32), keep your metadata in columns, and query with pure SQLWHERE for filters, ORDER BY distance(...) LIMIT k for ranking. Start with defaults, verify with EXPLAIN, and only tune when measurements demand it. ClickHouse

Call to action: Take one production search (FAQ, product catalog, support docs), load 1–5M embeddings, and A/B exact vs HNSW with your real filters. Lock in the config that meets your latency and recall goals.


Image prompt (for DALL·E/Midjourney)

“A clean, modern diagram of a ClickHouse cluster performing vector search: MergeTree table with embedding:Array(Float32), HNSW vector index, SQL query with WHERE metadata filters, ORDER BY cosineDistance(...) LIMIT k, partition pruning, and an index cache. Minimalistic, high contrast, isometric 3D.”

Tags

#ClickHouse #VectorSearch #HNSW #ANN #cosineDistance #DataEngineering #RAG #SQL #PerformanceTuning #OLAP

Leave a Reply

Your email address will not be published. Required fields are marked *