ClickHouse Keeper in Production

ClickHouse Keeper in Production: Topologies, Failover, and Disaster Drills

Why this matters (quick story)

Your ClickHouse cluster hums—until a routine deploy spikes load, a node blips, and schema changes stall. Dashboards still read, but writes and ON CLUSTER DDL hang. The culprit isn’t storage or CPU. It’s coordination. ClickHouse Keeper is the heartbeat that makes replication and distributed DDL reliable—if you set it up like an SRE, not a sidecar. This piece shows exactly how.

Keeper in one minute

What it is: ClickHouse’s built-in coordination service; a ZooKeeper-compatible replacement that runs RAFT for consensus. It coordinates replicated tables and distributed DDL. ClickHouse+2ClickHouse+2
Guarantees: Linearizable writes; reads are ZooKeeper-like by default (non-linearizable). ClickHouse
How it runs: Stand-alone service or embedded; same <keeper_server> config core. ClickHouse

Production topologies that won’t page you at 2am

Baseline pattern (recommended)

Dedicated Keeper hosts (not co-located with ClickHouse servers) to isolate I/O and latency. The docs explicitly recommend this for production. ClickHouse+1
Odd-node RAFT quorum (commonly 3 to start). Keeper uses RAFT majority, and ClickHouse’s reference “Replication + Scaling” example shows two shards × two replicas coordinated by a 3-node Keeper. ClickHouse+1
Modest sizing: Keeper nodes are light; ~4 GB RAM is generally enough until clusters grow large. ClickHouse
Separate disks (or at least separate volumes) from ClickHouse data; coordination is sensitive to disk latency. ClickHouse

3-node vs 5-node (quick compare)

Cluster	Tolerates	When to choose
3-node	1 node loss	Most prod clusters; simplest ops; lowest overhead. (RAFT majority.) ClickHouse
5-node	2 nodes loss	Stricter SLAs / multi-AZ + maintenance headroom; more ops cost. (RAFT majority.) ClickHouse

Mental model: Keeper is your control plane. Treat it like etcd in Kubernetes—small, redundant, boring, and protected.

A clean, minimal Keeper config

Keeper cluster (stand-alone), 3 nodes:

<!-- /etc/clickhouse-keeper/keeper_config.xml (example) -->
<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>         <!-- unique per node -->
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

    <raft_configuration>
      <server><id>1</id><hostname>keeper-1</hostname><port>9444</port></server>
      <server><id>2</id><hostname>keeper-2</hostname><port>9444</port></server>
      <server><id>3</id><hostname>keeper-3</hostname><port>9444</port></server>
    </raft_configuration>
  </keeper_server>
</clickhouse>

Settings & <keeper_server> fundamentals are documented here. ClickHouse

Point ClickHouse servers at Keeper:

<!-- /etc/clickhouse-server/config.d/keeper.xml -->
<clickhouse>
  <zookeeper> <!-- Keeper is ZooKeeper-compatible -->
    <node><host>keeper-1</host><port>9181</port></node>
    <node><host>keeper-2</host><port>9181</port></node>
    <node><host>keeper-3</host><port>9181</port></node>
  </zookeeper>
</clickhouse>

That’s the pattern shown in ClickHouse’s replicated deployment examples. GitHub+1

What Keeper actually coordinates

Replicated*MergeTree metadata and queues (/clickhouse/tables/...). Keepers store replica state and replication tasks. ClickHouse
Distributed DDL (ON CLUSTER): tasks are queued via Keeper so every node applies the change. ClickHouse+1

Observability: fast checks you’ll use daily

Run these from clickhouse-client:

-- Keeper connectivity / browse
SELECT * FROM system.zookeeper LIMIT 5;  -- exists only if ZK/Keeper configured

-- Replication health per table on the local node
SELECT database, table, is_leader, is_session_expired, absolute_delay
FROM system.replicas;

-- What’s stuck in replication?
SELECT database, table, replica_name, position, source_replica, required_quorum
FROM system.replication_queue
ORDER BY position
LIMIT 20;

-- Keeper calls log (latency/errors)
SELECT * FROM system.zookeeper_log ORDER BY event_time DESC LIMIT 20;

Docs for these system tables: system.zookeeper, system.replicas, system.replication_queue, system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3

CLI for Keeper:
You can also poke Keeper directly:

# List znodes at root
clickhouse-keeper-client -h keeper-1 -p 9181 --query "ls /"

# Create a test znode
clickhouse-keeper-client -h keeper-1 -p 9181 --query "create /ops_sanity 'ok'"

See clickhouse-keeper-client utility docs. ClickHouse

Failover behavior (what actually happens)

Leader loss: RAFT elects a new leader from remaining nodes (majority required). Writes that rely on Keeper pause during election; ClickHouse reads continue from local parts. (RAFT majority + Keeper role; replicated tables use Keeper for coordination.) ClickHouse+1
Keeper unavailable: Metadata operations for replicated tables and ON CLUSTER DDL fail/queue; server retries connecting. Documentation notes replicated tables attempt reconnect and restrict operations when Keeper is unreachable. DevDoc+1

Translation: Losing quorum means no changes to replicated metadata. Your queries over existing local parts still run; synchronization resumes when Keeper recovers.

Disaster drills (copy/paste runbooks)

Pro tip: rehearse these in a staging cluster that mirrors prod topology.

1) Leader loss (single node)

Goal: Verify election and that replication/DDL recover.

Stop the current leader Keeper (e.g., systemctl stop clickhouse-keeper on keeper-1).
Watch election stabilize: clickhouse-keeper-client -h keeper-2 --query "stat" (or list znodes steadily). ClickHouse
On a ClickHouse server, run: SELECT * FROM system.zookeeper LIMIT 1; -- should succeed SELECT database, table, is_leader, absolute_delay FROM system.replicas; Verify replication queues advance again. ClickHouse+1

2) Follower loss (no quorum loss)

Goal: Confirm the cluster tolerates one node down (3-node cluster).

Stop any non-leader Keeper.
Insert data into a replicated table; confirm new parts appear on other replicas and system.replication_queue stays small. ClickHouse

3) Quorum loss (majority down) — the scary one

Goal: Validate “safe stop” behavior.

Stop 2 of 3 Keeper nodes.
Attempt CREATE TABLE ... ON CLUSTER or insert into a ReplicatedMergeTree table. Expect errors or queuing; observe retries. ClickHouse+1
Restore quorum; verify: SYSTEM SYNC REPLICA db.tbl; -- wait until caught up SELECT * FROM system.replication_queue WHERE database='db' AND table='tbl'; Docs: SYSTEM statements & replication queue. ClickHouse+1

4) Replica rebuild (node replacement)

Goal: Practice a clean re-join.

On the dead node’s server: SYSTEM DROP REPLICA 'replica_name' FROM db.tbl; (Removes metadata path in Keeper.) ClickHouse
Recreate the table on the new node with Replicated*MergeTree and the same Keeper path/macros; it will fetch parts. ClickHouse

Backup & recovery notes (Keeper itself)

Keeper maintains snapshots and logs under configured paths; these are not format-compatible with ZooKeeper (a converter exists for ZK→Keeper migration). Ensure these directories are on reliable storage and included in backups. ClickHouse+1

Hardening & ops checklist

Topology: odd-node quorum; start with 3 dedicated Keeper nodes. ClickHouse
Isolation: keep Keeper separate from ClickHouse servers or at least on separate disks. ClickHouse
Sizing: begin around 4 GB RAM per Keeper; monitor and scale as your cluster grows. ClickHouse
Monitoring: alert on system.zookeeper_log errors, rising system.replication_queue length, or is_session_expired in system.replicas. ClickHouse+2ClickHouse+2
DDL discipline: use ON CLUSTER and track task status; don’t hand-edit schema on a single node. ClickHouse

Real-world snippets you’ll actually use

Create a replicated table (macros used for unique Keeper paths):

CREATE TABLE sales.events ON CLUSTER prod_cluster
(
  ts DateTime,
  user_id UInt64,
  amount Float32
)
ENGINE = ReplicatedMergeTree(
  '/clickhouse/tables/{shard}/sales.events',  -- keeper path
  '{replica}'                                  -- replica name
)
ORDER BY (ts, user_id);

Replication parameters and Keeper paths are documented; prefer the /clickhouse/tables/{shard}/... pattern. ClickHouse

Distributed DDL example:

ALTER TABLE sales.events ON CLUSTER prod_cluster
ADD COLUMN payment_method LowCardinality(String);

ON CLUSTER fans this to all nodes via Keeper’s DDL queue. ClickHouse

Common pitfalls (and how to avoid them)

Pitfall	Symptom	Fix
Co-locating Keeper with hot ClickHouse storage	Keeper timeouts, flapping sessions	Move Keeper to dedicated hosts or separate disks; it’s latency-sensitive. ClickHouse
Even number of Keeper nodes	No majority with half down	Use odd-node RAFT clusters (3/5). ClickHouse
Ignoring DDL orchestration	Drifted schemas, stuck queries	Always use `ON CLUSTER`; monitor DDL queue state. ClickHouse
No visibility into replication	Surprise lag	Watch `system.replicas` and `system.replication_queue`. ClickHouse+1
Not testing quorum loss	Outage during AZ failure	Run the drills above; practice `SYSTEM SYNC REPLICA` recovery. ClickHouse

Internal link ideas (official docs only)

ClickHouse Keeper overview & guarantees — why Keeper vs ZooKeeper. ClickHouse
Replicated table engines* — how Keeper stores replication metadata & paths. ClickHouse
Distributed DDL — ON CLUSTER semantics and queues. ClickHouse
Cluster deployment with dedicated Keeper — sizing and host separation. ClickHouse
System tables — system.zookeeper, system.replicas, system.replication_queue, system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3
Keeper client utility — clickhouse-keeper-client usage. ClickHouse

Summary & call to action

ClickHouse performance lives or dies by coordination health. Put Keeper on dedicated nodes, use odd-node RAFT quorums, watch replication/DDL queues, and rehearse failure. Do that, and topology hiccups turn into non-events.

Try this today:

Add dashboards for system.replicas and system.replication_queue.
Run the leader-loss drill in staging; record recovery time.
Audit configs to ensure Keeper is not sharing disks with data.

Image prompt (for DALL·E/Midjourney)

“A clean, modern architecture diagram of a ClickHouse deployment: two shards × two replicas, a 3-node ClickHouse Keeper quorum on separate hosts, distributed queries via ON CLUSTER, and arrows showing failover/election. Minimalistic, high contrast, isometric 3D.”

Data/ML Engineer Blog

ClickHouse Keeper in Production: Topologies, Failover, and Disaster Drills

Why this matters (quick story)

Keeper in one minute

Production topologies that won’t page you at 2am

Baseline pattern (recommended)

3-node vs 5-node (quick compare)

A clean, minimal Keeper config

What Keeper actually coordinates

Observability: fast checks you’ll use daily

Failover behavior (what actually happens)

Disaster drills (copy/paste runbooks)

1) Leader loss (single node)

2) Follower loss (no quorum loss)

3) Quorum loss (majority down) — the scary one

4) Replica rebuild (node replacement)

Backup & recovery notes (Keeper itself)

Hardening & ops checklist

Real-world snippets you’ll actually use

Common pitfalls (and how to avoid them)

Internal link ideas (official docs only)

Summary & call to action

Image prompt (for DALL·E/Midjourney)

Tags

Alex

Leave a Reply
Cancel reply

Leave a Reply

YOU MAY HAVE MISSED

Monitoring 101 for Data Engineers

Materialized Views in the Real World

Kafka Ingestion with Apache Doris Routine Load

Structured Logging 101

ClickHouse Keeper in Production: Topologies, Failover, and Disaster Drills

Why this matters (quick story)

Keeper in one minute

Production topologies that won’t page you at 2am

Baseline pattern (recommended)

3-node vs 5-node (quick compare)

A clean, minimal Keeper config

What Keeper actually coordinates

Observability: fast checks you’ll use daily

Failover behavior (what actually happens)

Disaster drills (copy/paste runbooks)

1) Leader loss (single node)

2) Follower loss (no quorum loss)

3) Quorum loss (majority down) — the scary one

4) Replica rebuild (node replacement)

Backup & recovery notes (Keeper itself)

Hardening & ops checklist

Real-world snippets you’ll actually use

Common pitfalls (and how to avoid them)

Internal link ideas (official docs only)

Summary & call to action

Image prompt (for DALL·E/Midjourney)

Tags

Related Story

Leave a Reply Cancel reply

Leave a Reply

YOU MAY HAVE MISSED

Leave a Reply
Cancel reply