Vector Search in ClickHouse: HNSW + Metadata Filtering with Pure SQL
Why this matters
Your team wants RAG-style search that respects user filters (region, product, role) and returns results in milliseconds. You could add a separate vector DB—or you can keep embeddings, analytics, and filtering in one place: ClickHouse. With its HNSW vector index and SQL, you get semantic search + classic filters without a new service to run.
Concept & architecture (mental model first)
Exact vs approximate search
- Exact (brute force):
ORDER BY distance(embedding, :query_vec)scans all rows; recall=1, slower at scale. - Approximate (HNSW index): a graph-based structure narrows to likely neighbors for 10–100× speedups with tunable recall. In ClickHouse this is a vector similarity index of type HNSW built over
Array(Float*)columns. ClickHouse
How ClickHouse wires it
- Store embeddings in
Array(Float32 | Float64 | BFloat16)(fixed dimension). Vector indexes are defined per-column and support HNSW with optional quantization (e.g., bf16, f16, i8, b1). Available in v25.8+. ClickHouse - Use distance functions (
cosineDistance,L2Distance, etc.) inORDER BYand LIMIT k. The distance in your query must match the index’s distance to be used. Check withEXPLAIN indexes = 1. ClickHouse+1
Filtering (pure SQL)
- ClickHouse evaluates vector search plus your
WHEREfilters using pre-/post-filtering, and can prune partitions/primary-key ranges first. That’s how you combine ANN + metadata filters cleanly. ClickHouse
Schema & index: a production-ready starter
-- 1) Documents with metadata + embedding
CREATE TABLE app.docs
(
id UInt64,
customer_id UInt64,
lang LowCardinality(String),
created_at DateTime('UTC'),
title String,
body String,
-- Fixed-dimension embedding (e.g., 1536)
embedding Array(Float32),
-- Enforce fixed length at write time (recommended)
CONSTRAINT dim_ok CHECK length(embedding) = 1536,
INDEX hnsw_vec embedding
TYPE vector_similarity('hnsw', 'cosineDistance', 1536, 'bf16', 32, 512)
-- ^type ^distance ^dims ^quant ^M ^ef_construction
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (customer_id, created_at);
Notes that save you time:
- Arrays must have the exact dimension you declare; index supports single columns with non-null, non-empty arrays. ClickHouse
- The HNSW index builds on inserts/merges; you can defer/index later and materialize when the cluster is quiet. ClickHouse
Query patterns: ANN + metadata filters
Top-k semantic search (post-filter)
WITH :q AS query_vec
SELECT id, title,
cosineDistance(embedding, query_vec) AS dist
FROM app.docs
WHERE lang = 'en' -- metadata filter
AND created_at >= now() - INTERVAL 30 DAY
ORDER BY dist
LIMIT 10
SETTINGS hnsw_candidate_list_size_for_search = 512; -- ef_search
- Query template that triggers the index is:
ORDER BY <Distance>(column, :query) LIMIT k. Distance must match index definition. ClickHouse - Tune recall/latency via
hnsw_candidate_list_size_for_search(aka ef_search). ClickHouse
Guarantee index usage (debug)
EXPLAIN indexes = 1
SELECT ...
Look for a Skip entry showing vector_similarity on your index. ClickHouse
Exact fallback (no index)
WITH :q AS query_vec
SELECT id, title
FROM app.docs
ORDER BY L2Distance(embedding, query_vec)
LIMIT 10;
Distance functions are built-in; choose cosine for normalized embeddings, L2 for Euclidean. ClickHouse
Tuning & capacity planning (the 80/20)
Index build & memory
- Index must be fully in memory for searches. Docs provide formulas to estimate memory for vectors and the HNSW graph; use them before turning on large-scale ANN. ClickHouse
- Default quantization is bf16; consider i8/b1 only when RAM is tight—enable rescoring to recover accuracy. ClickHouse
- Cache index granules with
vector_similarity_index_cache_size(default up to 5 GB). Size it to your index granule footprint. ClickHouse
Operational safety rails
- Ensure server
compatibilityis default/≥ ‘25.1’ for index use. ClickHouse - ClickHouse guards insane fan-outs with
max_limit_for_vector_search_queries(default 100). KeepLIMITsane. ClickHouse - HNSW builds can slow merges. Consider deferring with
materialize_skip_indexes_on_merge = 0andALTER ... MATERIALIZE INDEXduring off-hours. ClickHouse
Filtering strategy
- Partition key filters (e.g., month) prune early → less data to consider.
- Primary key prefix filters can be evaluated before vector ranking; otherwise ClickHouse applies post-filtering (may return < k rows if many candidates fail filters). ClickHouse
End-to-end mini workflow (repro in minutes)
- Create table and HNSW index (see DDL above).
- Load sample embeddings (e.g., OpenAI 1536-dim) as
Array(Float32); enforce theCHECKconstraint to catch bad rows. ClickHouse - Query with filters:
WHERE customer_id = :cid AND lang = 'en'thenORDER BY cosineDistance(...) LIMIT 10. - Verify usage with
EXPLAIN indexes = 1. - Tune ef_search for the latency/recall you need.
Best practices (and the traps to avoid)
Do this
- Normalize vectors for cosine distance; keep distance in index = distance in queries. ClickHouse+1
- Start with defaults:
M=32,ef_construction=128,ef_search=256and measure. ClickHouse - Keep
Array(Float32)unless accuracy demandsFloat64;BFloat16can cut RAM while preserving quality for many models. ClickHouse - Use
EXPLAINand index cache metrics to confirm behavior. ClickHouse
Avoid this
- Over-filtering first: if filters don’t hit partition/primary key, ClickHouse may post-filter after ANN, shrinking results. Plan keys accordingly. ClickHouse
- Mismatched distance (index built on
L2but query orders bycosineDistance)—the index won’t be used. ClickHouse - Building HNSW on fast-changing data; prefer immutable/append-heavy sets or materialize during low load. ClickHouse
Comparison cheat-sheet
| Aspect | Exact scan (no index) | HNSW vector index |
|---|---|---|
| Recall | 100% | Tunable (via ef_search) |
| Latency | Linear in rows | Sub-linear, typically 10–100× faster |
| RAM | Minimal | Index must fit (after quantization) |
| Writes | Fast | Slower; build cost on inserts/merges |
| Best for | Small sets, cold layers, offline reranks | Hot search, interactive UIs |
(Index behaviors, tuning knobs, and caveats per official docs.) ClickHouse
Internal link ideas (official docs only)
- Exact & Approximate Vector Search (HNSW, syntax, filters, tuning) — ClickHouse Docs. ClickHouse
- Distance functions (
cosineDistance,L2Distance, …) — SQL reference. ClickHouse - Example dataset: Hacker News vector search — quick start SQL. ClickHouse
Summary & next step
ClickHouse lets you embed vector search where your data already lives: define an HNSW vector similarity index on Array(Float32), keep your metadata in columns, and query with pure SQL—WHERE for filters, ORDER BY distance(...) LIMIT k for ranking. Start with defaults, verify with EXPLAIN, and only tune when measurements demand it. ClickHouse
Call to action: Take one production search (FAQ, product catalog, support docs), load 1–5M embeddings, and A/B exact vs HNSW with your real filters. Lock in the config that meets your latency and recall goals.
Image prompt (for DALL·E/Midjourney)
“A clean, modern diagram of a ClickHouse cluster performing vector search: MergeTree table with embedding:Array(Float32), HNSW vector index, SQL query with WHERE metadata filters, ORDER BY cosineDistance(...) LIMIT k, partition pruning, and an index cache. Minimalistic, high contrast, isometric 3D.”
Tags
#ClickHouse #VectorSearch #HNSW #ANN #cosineDistance #DataEngineering #RAG #SQL #PerformanceTuning #OLAP













Leave a Reply