ClickHouse Keeper in Production

ClickHouse Keeper in Production: Topologies, Failover, and Disaster Drills


Why this matters (quick story)

Your ClickHouse cluster hums—until a routine deploy spikes load, a node blips, and schema changes stall. Dashboards still read, but writes and ON CLUSTER DDL hang. The culprit isn’t storage or CPU. It’s coordination. ClickHouse Keeper is the heartbeat that makes replication and distributed DDL reliable—if you set it up like an SRE, not a sidecar. This piece shows exactly how.


Keeper in one minute

  • What it is: ClickHouse’s built-in coordination service; a ZooKeeper-compatible replacement that runs RAFT for consensus. It coordinates replicated tables and distributed DDL. ClickHouse+2ClickHouse+2
  • Guarantees: Linearizable writes; reads are ZooKeeper-like by default (non-linearizable). ClickHouse
  • How it runs: Stand-alone service or embedded; same <keeper_server> config core. ClickHouse

Production topologies that won’t page you at 2am

Baseline pattern (recommended)

  • Dedicated Keeper hosts (not co-located with ClickHouse servers) to isolate I/O and latency. The docs explicitly recommend this for production. ClickHouse+1
  • Odd-node RAFT quorum (commonly 3 to start). Keeper uses RAFT majority, and ClickHouse’s reference “Replication + Scaling” example shows two shards × two replicas coordinated by a 3-node Keeper. ClickHouse+1
  • Modest sizing: Keeper nodes are light; ~4 GB RAM is generally enough until clusters grow large. ClickHouse
  • Separate disks (or at least separate volumes) from ClickHouse data; coordination is sensitive to disk latency. ClickHouse

3-node vs 5-node (quick compare)

ClusterToleratesWhen to choose
3-node1 node lossMost prod clusters; simplest ops; lowest overhead. (RAFT majority.) ClickHouse
5-node2 nodes lossStricter SLAs / multi-AZ + maintenance headroom; more ops cost. (RAFT majority.) ClickHouse

Mental model: Keeper is your control plane. Treat it like etcd in Kubernetes—small, redundant, boring, and protected.


A clean, minimal Keeper config

Keeper cluster (stand-alone), 3 nodes:

<!-- /etc/clickhouse-keeper/keeper_config.xml (example) -->
<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>         <!-- unique per node -->
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

    <raft_configuration>
      <server><id>1</id><hostname>keeper-1</hostname><port>9444</port></server>
      <server><id>2</id><hostname>keeper-2</hostname><port>9444</port></server>
      <server><id>3</id><hostname>keeper-3</hostname><port>9444</port></server>
    </raft_configuration>
  </keeper_server>
</clickhouse>

Settings & <keeper_server> fundamentals are documented here. ClickHouse

Point ClickHouse servers at Keeper:

<!-- /etc/clickhouse-server/config.d/keeper.xml -->
<clickhouse>
  <zookeeper> <!-- Keeper is ZooKeeper-compatible -->
    <node><host>keeper-1</host><port>9181</port></node>
    <node><host>keeper-2</host><port>9181</port></node>
    <node><host>keeper-3</host><port>9181</port></node>
  </zookeeper>
</clickhouse>

That’s the pattern shown in ClickHouse’s replicated deployment examples. GitHub+1


What Keeper actually coordinates

  • Replicated*MergeTree metadata and queues (/clickhouse/tables/...). Keepers store replica state and replication tasks. ClickHouse
  • Distributed DDL (ON CLUSTER): tasks are queued via Keeper so every node applies the change. ClickHouse+1

Observability: fast checks you’ll use daily

Run these from clickhouse-client:

-- Keeper connectivity / browse
SELECT * FROM system.zookeeper LIMIT 5;  -- exists only if ZK/Keeper configured

-- Replication health per table on the local node
SELECT database, table, is_leader, is_session_expired, absolute_delay
FROM system.replicas;

-- What’s stuck in replication?
SELECT database, table, replica_name, position, source_replica, required_quorum
FROM system.replication_queue
ORDER BY position
LIMIT 20;

-- Keeper calls log (latency/errors)
SELECT * FROM system.zookeeper_log ORDER BY event_time DESC LIMIT 20;

Docs for these system tables: system.zookeeper, system.replicas, system.replication_queue, system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3

CLI for Keeper:
You can also poke Keeper directly:

# List znodes at root
clickhouse-keeper-client -h keeper-1 -p 9181 --query "ls /"

# Create a test znode
clickhouse-keeper-client -h keeper-1 -p 9181 --query "create /ops_sanity 'ok'"

See clickhouse-keeper-client utility docs. ClickHouse


Failover behavior (what actually happens)

  • Leader loss: RAFT elects a new leader from remaining nodes (majority required). Writes that rely on Keeper pause during election; ClickHouse reads continue from local parts. (RAFT majority + Keeper role; replicated tables use Keeper for coordination.) ClickHouse+1
  • Keeper unavailable: Metadata operations for replicated tables and ON CLUSTER DDL fail/queue; server retries connecting. Documentation notes replicated tables attempt reconnect and restrict operations when Keeper is unreachable. DevDoc+1

Translation: Losing quorum means no changes to replicated metadata. Your queries over existing local parts still run; synchronization resumes when Keeper recovers.


Disaster drills (copy/paste runbooks)

Pro tip: rehearse these in a staging cluster that mirrors prod topology.

1) Leader loss (single node)

Goal: Verify election and that replication/DDL recover.

  1. Stop the current leader Keeper (e.g., systemctl stop clickhouse-keeper on keeper-1).
  2. Watch election stabilize: clickhouse-keeper-client -h keeper-2 --query "stat" (or list znodes steadily). ClickHouse
  3. On a ClickHouse server, run: SELECT * FROM system.zookeeper LIMIT 1; -- should succeed SELECT database, table, is_leader, absolute_delay FROM system.replicas; Verify replication queues advance again. ClickHouse+1

2) Follower loss (no quorum loss)

Goal: Confirm the cluster tolerates one node down (3-node cluster).

  1. Stop any non-leader Keeper.
  2. Insert data into a replicated table; confirm new parts appear on other replicas and system.replication_queue stays small. ClickHouse

3) Quorum loss (majority down) — the scary one

Goal: Validate “safe stop” behavior.

  1. Stop 2 of 3 Keeper nodes.
  2. Attempt CREATE TABLE ... ON CLUSTER or insert into a ReplicatedMergeTree table. Expect errors or queuing; observe retries. ClickHouse+1
  3. Restore quorum; verify: SYSTEM SYNC REPLICA db.tbl; -- wait until caught up SELECT * FROM system.replication_queue WHERE database='db' AND table='tbl'; Docs: SYSTEM statements & replication queue. ClickHouse+1

4) Replica rebuild (node replacement)

Goal: Practice a clean re-join.

  1. On the dead node’s server: SYSTEM DROP REPLICA 'replica_name' FROM db.tbl; (Removes metadata path in Keeper.) ClickHouse
  2. Recreate the table on the new node with Replicated*MergeTree and the same Keeper path/macros; it will fetch parts. ClickHouse

Backup & recovery notes (Keeper itself)

  • Keeper maintains snapshots and logs under configured paths; these are not format-compatible with ZooKeeper (a converter exists for ZK→Keeper migration). Ensure these directories are on reliable storage and included in backups. ClickHouse+1

Hardening & ops checklist

  • Topology: odd-node quorum; start with 3 dedicated Keeper nodes. ClickHouse
  • Isolation: keep Keeper separate from ClickHouse servers or at least on separate disks. ClickHouse
  • Sizing: begin around 4 GB RAM per Keeper; monitor and scale as your cluster grows. ClickHouse
  • Monitoring: alert on system.zookeeper_log errors, rising system.replication_queue length, or is_session_expired in system.replicas. ClickHouse+2ClickHouse+2
  • DDL discipline: use ON CLUSTER and track task status; don’t hand-edit schema on a single node. ClickHouse

Real-world snippets you’ll actually use

Create a replicated table (macros used for unique Keeper paths):

CREATE TABLE sales.events ON CLUSTER prod_cluster
(
  ts DateTime,
  user_id UInt64,
  amount Float32
)
ENGINE = ReplicatedMergeTree(
  '/clickhouse/tables/{shard}/sales.events',  -- keeper path
  '{replica}'                                  -- replica name
)
ORDER BY (ts, user_id);

Replication parameters and Keeper paths are documented; prefer the /clickhouse/tables/{shard}/... pattern. ClickHouse

Distributed DDL example:

ALTER TABLE sales.events ON CLUSTER prod_cluster
ADD COLUMN payment_method LowCardinality(String);

ON CLUSTER fans this to all nodes via Keeper’s DDL queue. ClickHouse


Common pitfalls (and how to avoid them)

PitfallSymptomFix
Co-locating Keeper with hot ClickHouse storageKeeper timeouts, flapping sessionsMove Keeper to dedicated hosts or separate disks; it’s latency-sensitive. ClickHouse
Even number of Keeper nodesNo majority with half downUse odd-node RAFT clusters (3/5). ClickHouse
Ignoring DDL orchestrationDrifted schemas, stuck queriesAlways use ON CLUSTER; monitor DDL queue state. ClickHouse
No visibility into replicationSurprise lagWatch system.replicas and system.replication_queue. ClickHouse+1
Not testing quorum lossOutage during AZ failureRun the drills above; practice SYSTEM SYNC REPLICA recovery. ClickHouse

Internal link ideas (official docs only)

  • ClickHouse Keeper overview & guarantees — why Keeper vs ZooKeeper. ClickHouse
  • Replicated table engines* — how Keeper stores replication metadata & paths. ClickHouse
  • Distributed DDLON CLUSTER semantics and queues. ClickHouse
  • Cluster deployment with dedicated Keeper — sizing and host separation. ClickHouse
  • System tablessystem.zookeeper, system.replicas, system.replication_queue, system.zookeeper_log. ClickHouse+3ClickHouse+3ClickHouse+3
  • Keeper client utilityclickhouse-keeper-client usage. ClickHouse

Summary & call to action

ClickHouse performance lives or dies by coordination health. Put Keeper on dedicated nodes, use odd-node RAFT quorums, watch replication/DDL queues, and rehearse failure. Do that, and topology hiccups turn into non-events.

Try this today:

  • Add dashboards for system.replicas and system.replication_queue.
  • Run the leader-loss drill in staging; record recovery time.
  • Audit configs to ensure Keeper is not sharing disks with data.

Image prompt (for DALL·E/Midjourney)

“A clean, modern architecture diagram of a ClickHouse deployment: two shards × two replicas, a 3-node ClickHouse Keeper quorum on separate hosts, distributed queries via ON CLUSTER, and arrows showing failover/election. Minimalistic, high contrast, isometric 3D.”

Tags

#ClickHouse #ClickHouseKeeper #HighAvailability #Replication #Raft #DistributedSystems #SRE #DataEngineering #OLAP #ProductionOps

Leave a Reply

Your email address will not be published. Required fields are marked *