ClickHouse Keeper in Production: Topologies, Failover, and Disaster Drills
Why this matters (quick story)
Your ClickHouse cluster hums—until a routine deploy spikes load, a node blips, and schema changes stall. Dashboards still read, but writes and ON CLUSTER DDL hang. The culprit isn’t storage or CPU. It’s coordination. ClickHouse Keeper is the heartbeat that makes replication and distributed DDL reliable—if you set it up like an SRE, not a sidecar. This piece shows exactly how.
Keeper in one minute
- What it is: ClickHouse’s built-in coordination service; a ZooKeeper-compatible replacement that runs RAFT for consensus. It coordinates replicated tables and distributed DDL. ClickHouse+2ClickHouse+2
- Guarantees: Linearizable writes; reads are ZooKeeper-like by default (non-linearizable). ClickHouse
- How it runs: Stand-alone service or embedded; same
<keeper_server>config core. ClickHouse
Production topologies that won’t page you at 2am
Baseline pattern (recommended)
- Dedicated Keeper hosts (not co-located with ClickHouse servers) to isolate I/O and latency. The docs explicitly recommend this for production. ClickHouse+1
- Odd-node RAFT quorum (commonly 3 to start). Keeper uses RAFT majority, and ClickHouse’s reference “Replication + Scaling” example shows two shards × two replicas coordinated by a 3-node Keeper. ClickHouse+1
- Modest sizing: Keeper nodes are light; ~4 GB RAM is generally enough until clusters grow large. ClickHouse
- Separate disks (or at least separate volumes) from ClickHouse data; coordination is sensitive to disk latency. ClickHouse
3-node vs 5-node (quick compare)
| Cluster | Tolerates | When to choose |
|---|---|---|
| 3-node | 1 node loss | Most prod clusters; simplest ops; lowest overhead. (RAFT majority.) ClickHouse |
| 5-node | 2 nodes loss | Stricter SLAs / multi-AZ + maintenance headroom; more ops cost. (RAFT majority.) ClickHouse |
Mental model: Keeper is your control plane. Treat it like etcd in Kubernetes—small, redundant, boring, and protected.
A clean, minimal Keeper config
Keeper cluster (stand-alone), 3 nodes:
<!-- /etc/clickhouse-keeper/keeper_config.xml (example) -->
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id> <!-- unique per node -->
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<raft_configuration>
<server><id>1</id><hostname>keeper-1</hostname><port>9444</port></server>
<server><id>2</id><hostname>keeper-2</hostname><port>9444</port></server>
<server><id>3</id><hostname>keeper-3</hostname><port>9444</port></server>
</raft_configuration>
</keeper_server>
</clickhouse>
Settings & <keeper_server> fundamentals are documented here. ClickHouse
Point ClickHouse servers at Keeper:
<!-- /etc/clickhouse-server/config.d/keeper.xml -->
<clickhouse>
<zookeeper> <!-- Keeper is ZooKeeper-compatible -->
<node><host>keeper-1</host><port>9181</port></node>
<node><host>keeper-2</host><port>9181</port></node>
<node><host>keeper-3</host><port>9181</port></node>
</zookeeper>
</clickhouse>
That’s the pattern shown in ClickHouse’s replicated deployment examples. GitHub+1
What Keeper actually coordinates
- Replicated*MergeTree metadata and queues (
/clickhouse/tables/...). Keepers store replica state and replication tasks. ClickHouse - Distributed DDL (
ON CLUSTER): tasks are queued via Keeper so every node applies the change. ClickHouse+1
Observability: fast checks you’ll use daily
Run these from clickhouse-client:
-- Keeper connectivity / browse
SELECT * FROM system.zookeeper LIMIT 5; -- exists only if ZK/Keeper configured
-- Replication health per table on the local node
SELECT database, table, is_leader, is_session_expired, absolute_delay
FROM system.replicas;
-- What’s stuck in replication?
SELECT database, table, replica_name, position, source_replica, required_quorum
FROM system.replication_queue
ORDER BY position
LIMIT 20;
-- Keeper calls log (latency/errors)
SELECT * FROM system.zookeeper_log ORDER BY event_time DESC LIMIT 20;
Docs for these system tables: system.zookeeper, system.replicas, system.replication_queue, system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3
CLI for Keeper:
You can also poke Keeper directly:
# List znodes at root
clickhouse-keeper-client -h keeper-1 -p 9181 --query "ls /"
# Create a test znode
clickhouse-keeper-client -h keeper-1 -p 9181 --query "create /ops_sanity 'ok'"
See clickhouse-keeper-client utility docs. ClickHouse
Failover behavior (what actually happens)
- Leader loss: RAFT elects a new leader from remaining nodes (majority required). Writes that rely on Keeper pause during election; ClickHouse reads continue from local parts. (RAFT majority + Keeper role; replicated tables use Keeper for coordination.) ClickHouse+1
- Keeper unavailable: Metadata operations for replicated tables and
ON CLUSTERDDL fail/queue; server retries connecting. Documentation notes replicated tables attempt reconnect and restrict operations when Keeper is unreachable. DevDoc+1
Translation: Losing quorum means no changes to replicated metadata. Your queries over existing local parts still run; synchronization resumes when Keeper recovers.
Disaster drills (copy/paste runbooks)
Pro tip: rehearse these in a staging cluster that mirrors prod topology.
1) Leader loss (single node)
Goal: Verify election and that replication/DDL recover.
- Stop the current leader Keeper (e.g.,
systemctl stop clickhouse-keeperonkeeper-1). - Watch election stabilize:
clickhouse-keeper-client -h keeper-2 --query "stat"(or list znodes steadily). ClickHouse - On a ClickHouse server, run:
SELECT * FROM system.zookeeper LIMIT 1; -- should succeed SELECT database, table, is_leader, absolute_delay FROM system.replicas;Verify replication queues advance again. ClickHouse+1
2) Follower loss (no quorum loss)
Goal: Confirm the cluster tolerates one node down (3-node cluster).
- Stop any non-leader Keeper.
- Insert data into a replicated table; confirm new parts appear on other replicas and
system.replication_queuestays small. ClickHouse
3) Quorum loss (majority down) — the scary one
Goal: Validate “safe stop” behavior.
- Stop 2 of 3 Keeper nodes.
- Attempt
CREATE TABLE ... ON CLUSTERor insert into aReplicatedMergeTreetable. Expect errors or queuing; observe retries. ClickHouse+1 - Restore quorum; verify:
SYSTEM SYNC REPLICA db.tbl; -- wait until caught up SELECT * FROM system.replication_queue WHERE database='db' AND table='tbl';Docs: SYSTEM statements & replication queue. ClickHouse+1
4) Replica rebuild (node replacement)
Goal: Practice a clean re-join.
- On the dead node’s server:
SYSTEM DROP REPLICA 'replica_name' FROM db.tbl;(Removes metadata path in Keeper.) ClickHouse - Recreate the table on the new node with
Replicated*MergeTreeand the same Keeper path/macros; it will fetch parts. ClickHouse
Backup & recovery notes (Keeper itself)
- Keeper maintains snapshots and logs under configured paths; these are not format-compatible with ZooKeeper (a converter exists for ZK→Keeper migration). Ensure these directories are on reliable storage and included in backups. ClickHouse+1
Hardening & ops checklist
- Topology: odd-node quorum; start with 3 dedicated Keeper nodes. ClickHouse
- Isolation: keep Keeper separate from ClickHouse servers or at least on separate disks. ClickHouse
- Sizing: begin around 4 GB RAM per Keeper; monitor and scale as your cluster grows. ClickHouse
- Monitoring: alert on
system.zookeeper_logerrors, risingsystem.replication_queuelength, oris_session_expiredinsystem.replicas. ClickHouse+2ClickHouse+2 - DDL discipline: use
ON CLUSTERand track task status; don’t hand-edit schema on a single node. ClickHouse
Real-world snippets you’ll actually use
Create a replicated table (macros used for unique Keeper paths):
CREATE TABLE sales.events ON CLUSTER prod_cluster
(
ts DateTime,
user_id UInt64,
amount Float32
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/sales.events', -- keeper path
'{replica}' -- replica name
)
ORDER BY (ts, user_id);
Replication parameters and Keeper paths are documented; prefer the /clickhouse/tables/{shard}/... pattern. ClickHouse
Distributed DDL example:
ALTER TABLE sales.events ON CLUSTER prod_cluster
ADD COLUMN payment_method LowCardinality(String);
ON CLUSTER fans this to all nodes via Keeper’s DDL queue. ClickHouse
Common pitfalls (and how to avoid them)
| Pitfall | Symptom | Fix |
|---|---|---|
| Co-locating Keeper with hot ClickHouse storage | Keeper timeouts, flapping sessions | Move Keeper to dedicated hosts or separate disks; it’s latency-sensitive. ClickHouse |
| Even number of Keeper nodes | No majority with half down | Use odd-node RAFT clusters (3/5). ClickHouse |
| Ignoring DDL orchestration | Drifted schemas, stuck queries | Always use ON CLUSTER; monitor DDL queue state. ClickHouse |
| No visibility into replication | Surprise lag | Watch system.replicas and system.replication_queue. ClickHouse+1 |
| Not testing quorum loss | Outage during AZ failure | Run the drills above; practice SYSTEM SYNC REPLICA recovery. ClickHouse |
Internal link ideas (official docs only)
- ClickHouse Keeper overview & guarantees — why Keeper vs ZooKeeper. ClickHouse
- Replicated table engines* — how Keeper stores replication metadata & paths. ClickHouse
- Distributed DDL —
ON CLUSTERsemantics and queues. ClickHouse - Cluster deployment with dedicated Keeper — sizing and host separation. ClickHouse
- System tables —
system.zookeeper,system.replicas,system.replication_queue,system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3 - Keeper client utility —
clickhouse-keeper-clientusage. ClickHouse
Summary & call to action
ClickHouse performance lives or dies by coordination health. Put Keeper on dedicated nodes, use odd-node RAFT quorums, watch replication/DDL queues, and rehearse failure. Do that, and topology hiccups turn into non-events.
Try this today:
- Add dashboards for
system.replicasandsystem.replication_queue. - Run the leader-loss drill in staging; record recovery time.
- Audit configs to ensure Keeper is not sharing disks with data.
Image prompt (for DALL·E/Midjourney)
“A clean, modern architecture diagram of a ClickHouse deployment: two shards × two replicas, a 3-node ClickHouse Keeper quorum on separate hosts, distributed queries via ON CLUSTER, and arrows showing failover/election. Minimalistic, high contrast, isometric 3D.”
Tags
#ClickHouse #ClickHouseKeeper #HighAvailability #Replication #Raft #DistributedSystems #SRE #DataEngineering #OLAP #ProductionOps













Leave a Reply