Designing a Real-Time Analytics Stack

Designing a Real-Time Analytics Stack: Kafka → ClickHouse with Exactly-Once Semantics

The problem (and why this matters)

Dashboards lag, incidents hide in the noise, and your OLTP store hates ad-hoc scans. You need millisecond analytics over streaming events—without duplicate rows or silent data loss when brokers hiccup. This guide shows how to wire Kafka → ClickHouse and choose the right ingestion option (and config) to hit exactly-once where it’s actually guaranteed.


The three ways to ingest Kafka into ClickHouse

OptionWhere it runsSemanticsWhen to choose
ClickPipes for KafkaClickHouse Cloud (managed)At-least-onceLowest ops, cloud-native; you accept de-duping downstream. ClickHouse
Kafka Connect Sink (official)Any Kafka ConnectExactly-once (EOS) with built-in state storeYou need EOS + rich connector controls. ClickHouse+1
Kafka Table Engine + Materialized ViewsInside ClickHouseAt-least-onceSelf-host, simple, push & pull; pair with dedup engines if needed. ClickHouse+1

ClickHouse’s “Integrating Kafka” page is the clearest summary: ClickPipes = at-least-once, Kafka Connect Sink = exactly-once, Kafka table engine = at-least-once. Use this to set expectations with your stakeholders. ClickHouse


Architecture at a glance

  • Producers write to Kafka with transactions (idempotence + transactional.id).
  • Sink consumes with read_committed (no uncommitted/aborted records).
  • ClickHouse stores to a MergeTree family table (often ReplicatedMergeTree for HA). Apache Kafka+2Apache Kafka+2

Path 1 — Exactly-once with the ClickHouse Kafka Connect Sink

Why this is the straightest line to EOS: the official ClickHouse Kafka Connect Sink ships with EOS support. It uses a state store (KeeperMap) to coordinate and avoid duplicates; enable exactlyOnce=true. ClickHouse

Minimal connector config (JSON):

{
  "name": "clickhouse-connect",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "events",
    "hostname": "my-clickhouse.example",
    "database": "analytics",
    "username": "default",
    "password": "<SECRET>",
    "ssl": "true",
    "exactlyOnce": "true",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false"
  }
}

Docs note: EOS is listed as a main feature, powered by KeeperMap; you can also pick other state stores. ClickHouse

Producer & consumer prerequisites for EOS

  • Producer: set a transactional.id (enables idempotence automatically) and wrap sends in a transaction. Apache Kafka
  • Consumer (used by Connect): must read committed only (isolation.level=read_committed). Apache Kafka

Java producer snippet (transactions):

Properties p = new Properties();
p.put("bootstrap.servers", "broker:9092");
p.put("enable.idempotence", "true");        // implied when transactional.id set
p.put("acks", "all");
p.put("transactional.id", "orders-producer-1");

KafkaProducer<String, byte[]> producer = new KafkaProducer<>(p);
producer.initTransactions();
try {
  producer.beginTransaction();
  producer.send(new ProducerRecord<>("events", key, value));
  // ...batch more sends...
  producer.commitTransaction();
} catch (Exception e) {
  producer.abortTransaction();
}

This setup, plus the Sink’s EOS, gives you the strongest duplicate-free path end-to-end. Apache Kafka


Path 2 — ClickPipes (managed) for speed to value

If you’re on ClickHouse Cloud, ClickPipes is a few clicks: it scales, supports common formats, has built-in retries & error tables, and simple controls. It’s at-least-once (design for dedup downstream if you can’t tolerate dupes). ClickHouse+1

Where it shines: lowest ops, quick wins, multi-source onboarding, Terraform/API automation. Where it doesn’t: you need strict EOS guarantees—then use the Kafka Connect Sink. ClickHouse


Path 3 — Kafka Table Engine + Materialized Views (self-hosted simple)

The Kafka table engine lets ClickHouse consume from topics; materialized views trigger on incoming batches to write into your target MergeTree table. Simple, battle-tested, but at-least-once. ClickHouse+1

DDL blueprint:

-- Kafka reader (source)
CREATE TABLE src.kafka_events
(
  ts DateTime64(3),
  user_id UInt64,
  event_type LowCardinality(String),
  payload String
)
ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'broker:9092',
  kafka_topic_list  = 'events',
  kafka_group_name  = 'ckh-events',
  kafka_format      = 'JSONEachRow',
  kafka_num_consumers = 4;

-- Target (replicated for HA)
CREATE TABLE dst.events
(
  ts DateTime64(3),
  user_id UInt64,
  event_type LowCardinality(String),
  payload String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (user_id, ts);

-- Incremental ingestion (MV)
CREATE MATERIALIZED VIEW pipe.kafka_to_events
TO dst.events AS
SELECT ts, user_id, event_type, payload
FROM src.kafka_events;
  • Kafka engine consumes once and advances offsets; MVs persist data into MergeTree. ClickHouse
  • Use ReplicatedMergeTree for durability/HA in production. ClickHouse

Want stronger de-dup on this path? Model with ReplacingMergeTree (ORDER BY your business key; optional version column). Know that dedup happens on background merges, so duplicates may persist briefly. ClickHouse


Exactly-once: what it really means here

  • End-to-end EOS requires: transactional producers, a consumer reading only committed records, and a sink that writes idempotently and coordinates state. In practice, that’s Kafka transactions + read_committed + ClickHouse Kafka Connect Sink with exactlyOnce=true. Apache Kafka+2Apache Kafka+2
  • ClickPipes and Kafka engine deliberately target at-least-once. Plan dedup or idempotent upserts if you pick them. ClickHouse

Performance checklist (mid-level, pragmatic)

  • Batching: Produce in batches and let Connect control insert sizing; avoid tiny messages.
  • Keys & partitions: Pick a key that balances partitions and preserves query locality (e.g., user_id for per-user reads).
  • ORDER BY: Align ClickHouse ORDER BY with your hot filters to minimize scanned bytes.
  • Backpressure: Watch Connect consumer lag and ClickHouse system.parts growth.
  • Schema: Prefer typed JSON/Avro/Protobuf; don’t parse giant blobs in SELECT.
  • HA: Use ReplicatedMergeTree and a multi-broker Kafka cluster.

Common pitfalls (and blunt fixes)

  • “We set EOS but still see dupes.” Verify producers actually use transactions, and the sink is read_committed and exactlyOnce=true. Otherwise you have only at-least-once. Apache Kafka+1
  • Kafka engine floods CPU. Increase kafka_num_consumers only up to partition/core counts; do heavy transforms in the MV or downstream. ClickHouse
  • ClickPipes surprises: It’s managed and at-least-once—design schemas and queries for dedup (business keys + ReplacingMergeTree). ClickHouse+1
  • Single replica writes: With non-replicated tables you risk gaps on failover. Use ReplicatedMergeTree. ClickHouse

End-to-end example: producer → EOS sink → ClickHouse

  1. Producer (transactions): set transactional.id, acks=all; wrap sends in beginTransaction/commitTransaction. Apache Kafka
  2. Sink (Kafka Connect): deploy the ClickHouse Kafka Connect Sink, set exactlyOnce=true, point to an existing target table. ClickHouse
  3. ClickHouse table (target):
CREATE TABLE analytics.events
(
  event_id UUID,
  ts       DateTime64(3),
  user_id  UInt64,
  type     LowCardinality(String),
  attrs    JSON
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (event_id, ts);
  1. Query safely: your downstream readers see one row per committed event; duplicates are prevented by the sink’s EOS logic at write time. ClickHouse

Internal link ideas (official docs)

  • Integrating Kafka with ClickHouse (options & semantics) — start here. ClickHouse
  • ClickHouse Kafka Connect Sink (EOS + config) — reference & examples. ClickHouse
  • ClickPipes for Kafka — managed ingestion in Cloud. ClickHouse
  • Kafka table engine — settings, examples, MV wiring. ClickHouse
  • Materialized Views — how inserts trigger view pipelines. ClickHouse
  • ReplicatedMergeTree — replication and background fetches. ClickHouse
  • Kafka transactions & read_committed — official API docs (Javadoc). Apache Kafka+1

Summary & call to action

  • If you need exactly-once right now, use Kafka Connect Sink with transactions (transactional.id) and consumers on read_committed.
  • If you want lowest ops in Cloud and can live with at-least-once, use ClickPipes.
  • If you’re self-hosting and want simple plumbing, use Kafka engine + MVs, but model for dedup.

Try this today: stand up the ClickHouse Kafka Connect Sink with exactlyOnce=true and flip your producer to transactions. Measure dupes and lag before/after.


Image prompt

“A clean architecture diagram of a Kafka→ClickHouse pipeline: transactional producers → Kafka → Kafka Connect Sink (EOS) → ReplicatedMergeTree in ClickHouse; alternative lanes show ClickPipes and Kafka table engine+MV. Minimalistic, high contrast, isometric 3D.”

Tags

#ClickHouse #Kafka #ExactlyOnce #DataEngineering #Streaming #MaterializedViews #ClickPipes #KafkaConnect #ReplicatedMergeTree #RealTimeAnalytics

Leave a Reply

Your email address will not be published. Required fields are marked *