Designing a Real-Time Analytics Stack: Kafka → ClickHouse with Exactly-Once Semantics
The problem (and why this matters)
Dashboards lag, incidents hide in the noise, and your OLTP store hates ad-hoc scans. You need millisecond analytics over streaming events—without duplicate rows or silent data loss when brokers hiccup. This guide shows how to wire Kafka → ClickHouse and choose the right ingestion option (and config) to hit exactly-once where it’s actually guaranteed.
The three ways to ingest Kafka into ClickHouse
| Option | Where it runs | Semantics | When to choose |
|---|---|---|---|
| ClickPipes for Kafka | ClickHouse Cloud (managed) | At-least-once | Lowest ops, cloud-native; you accept de-duping downstream. ClickHouse |
| Kafka Connect Sink (official) | Any Kafka Connect | Exactly-once (EOS) with built-in state store | You need EOS + rich connector controls. ClickHouse+1 |
| Kafka Table Engine + Materialized Views | Inside ClickHouse | At-least-once | Self-host, simple, push & pull; pair with dedup engines if needed. ClickHouse+1 |
ClickHouse’s “Integrating Kafka” page is the clearest summary: ClickPipes = at-least-once, Kafka Connect Sink = exactly-once, Kafka table engine = at-least-once. Use this to set expectations with your stakeholders. ClickHouse
Architecture at a glance
- Producers write to Kafka with transactions (idempotence +
transactional.id). - Sink consumes with
read_committed(no uncommitted/aborted records). - ClickHouse stores to a MergeTree family table (often
ReplicatedMergeTreefor HA). Apache Kafka+2Apache Kafka+2
Path 1 — Exactly-once with the ClickHouse Kafka Connect Sink
Why this is the straightest line to EOS: the official ClickHouse Kafka Connect Sink ships with EOS support. It uses a state store (KeeperMap) to coordinate and avoid duplicates; enable exactlyOnce=true. ClickHouse
Minimal connector config (JSON):
{
"name": "clickhouse-connect",
"config": {
"connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
"tasks.max": "1",
"topics": "events",
"hostname": "my-clickhouse.example",
"database": "analytics",
"username": "default",
"password": "<SECRET>",
"ssl": "true",
"exactlyOnce": "true",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
Docs note: EOS is listed as a main feature, powered by KeeperMap; you can also pick other state stores. ClickHouse
Producer & consumer prerequisites for EOS
- Producer: set a
transactional.id(enables idempotence automatically) and wrap sends in a transaction. Apache Kafka - Consumer (used by Connect): must read committed only (
isolation.level=read_committed). Apache Kafka
Java producer snippet (transactions):
Properties p = new Properties();
p.put("bootstrap.servers", "broker:9092");
p.put("enable.idempotence", "true"); // implied when transactional.id set
p.put("acks", "all");
p.put("transactional.id", "orders-producer-1");
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(p);
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("events", key, value));
// ...batch more sends...
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
This setup, plus the Sink’s EOS, gives you the strongest duplicate-free path end-to-end. Apache Kafka
Path 2 — ClickPipes (managed) for speed to value
If you’re on ClickHouse Cloud, ClickPipes is a few clicks: it scales, supports common formats, has built-in retries & error tables, and simple controls. It’s at-least-once (design for dedup downstream if you can’t tolerate dupes). ClickHouse+1
Where it shines: lowest ops, quick wins, multi-source onboarding, Terraform/API automation. Where it doesn’t: you need strict EOS guarantees—then use the Kafka Connect Sink. ClickHouse
Path 3 — Kafka Table Engine + Materialized Views (self-hosted simple)
The Kafka table engine lets ClickHouse consume from topics; materialized views trigger on incoming batches to write into your target MergeTree table. Simple, battle-tested, but at-least-once. ClickHouse+1
DDL blueprint:
-- Kafka reader (source)
CREATE TABLE src.kafka_events
(
ts DateTime64(3),
user_id UInt64,
event_type LowCardinality(String),
payload String
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'events',
kafka_group_name = 'ckh-events',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4;
-- Target (replicated for HA)
CREATE TABLE dst.events
(
ts DateTime64(3),
user_id UInt64,
event_type LowCardinality(String),
payload String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (user_id, ts);
-- Incremental ingestion (MV)
CREATE MATERIALIZED VIEW pipe.kafka_to_events
TO dst.events AS
SELECT ts, user_id, event_type, payload
FROM src.kafka_events;
- Kafka engine consumes once and advances offsets; MVs persist data into MergeTree. ClickHouse
- Use
ReplicatedMergeTreefor durability/HA in production. ClickHouse
Want stronger de-dup on this path? Model with ReplacingMergeTree (ORDER BY your business key; optional version column). Know that dedup happens on background merges, so duplicates may persist briefly. ClickHouse
Exactly-once: what it really means here
- End-to-end EOS requires: transactional producers, a consumer reading only committed records, and a sink that writes idempotently and coordinates state. In practice, that’s Kafka transactions +
read_committed+ ClickHouse Kafka Connect Sink withexactlyOnce=true. Apache Kafka+2Apache Kafka+2 - ClickPipes and Kafka engine deliberately target at-least-once. Plan dedup or idempotent upserts if you pick them. ClickHouse
Performance checklist (mid-level, pragmatic)
- Batching: Produce in batches and let Connect control insert sizing; avoid tiny messages.
- Keys & partitions: Pick a key that balances partitions and preserves query locality (e.g., user_id for per-user reads).
- ORDER BY: Align ClickHouse
ORDER BYwith your hot filters to minimize scanned bytes. - Backpressure: Watch Connect consumer lag and ClickHouse
system.partsgrowth. - Schema: Prefer typed JSON/Avro/Protobuf; don’t parse giant blobs in SELECT.
- HA: Use
ReplicatedMergeTreeand a multi-broker Kafka cluster.
Common pitfalls (and blunt fixes)
- “We set EOS but still see dupes.” Verify producers actually use transactions, and the sink is read_committed and
exactlyOnce=true. Otherwise you have only at-least-once. Apache Kafka+1 - Kafka engine floods CPU. Increase
kafka_num_consumersonly up to partition/core counts; do heavy transforms in the MV or downstream. ClickHouse - ClickPipes surprises: It’s managed and at-least-once—design schemas and queries for dedup (business keys +
ReplacingMergeTree). ClickHouse+1 - Single replica writes: With non-replicated tables you risk gaps on failover. Use ReplicatedMergeTree. ClickHouse
End-to-end example: producer → EOS sink → ClickHouse
- Producer (transactions): set
transactional.id,acks=all; wrap sends inbeginTransaction/commitTransaction. Apache Kafka - Sink (Kafka Connect): deploy the ClickHouse Kafka Connect Sink, set
exactlyOnce=true, point to an existing target table. ClickHouse - ClickHouse table (target):
CREATE TABLE analytics.events
(
event_id UUID,
ts DateTime64(3),
user_id UInt64,
type LowCardinality(String),
attrs JSON
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (event_id, ts);
- Query safely: your downstream readers see one row per committed event; duplicates are prevented by the sink’s EOS logic at write time. ClickHouse
Internal link ideas (official docs)
- Integrating Kafka with ClickHouse (options & semantics) — start here. ClickHouse
- ClickHouse Kafka Connect Sink (EOS + config) — reference & examples. ClickHouse
- ClickPipes for Kafka — managed ingestion in Cloud. ClickHouse
- Kafka table engine — settings, examples, MV wiring. ClickHouse
- Materialized Views — how inserts trigger view pipelines. ClickHouse
- ReplicatedMergeTree — replication and background fetches. ClickHouse
- Kafka transactions &
read_committed— official API docs (Javadoc). Apache Kafka+1
Summary & call to action
- If you need exactly-once right now, use Kafka Connect Sink with transactions (
transactional.id) and consumers onread_committed. - If you want lowest ops in Cloud and can live with at-least-once, use ClickPipes.
- If you’re self-hosting and want simple plumbing, use Kafka engine + MVs, but model for dedup.
Try this today: stand up the ClickHouse Kafka Connect Sink with exactlyOnce=true and flip your producer to transactions. Measure dupes and lag before/after.
Image prompt
“A clean architecture diagram of a Kafka→ClickHouse pipeline: transactional producers → Kafka → Kafka Connect Sink (EOS) → ReplicatedMergeTree in ClickHouse; alternative lanes show ClickPipes and Kafka table engine+MV. Minimalistic, high contrast, isometric 3D.”
Tags
#ClickHouse #Kafka #ExactlyOnce #DataEngineering #Streaming #MaterializedViews #ClickPipes #KafkaConnect #ReplicatedMergeTree #RealTimeAnalytics













Leave a Reply