Apache Pulsar – Data/ML Engineer Blog

Apache Pulsar for Data Engineers: When a Message Bus Becomes a Streaming Platform

Hook: Your Kafka cluster groans under “one topic = one history forever.” Cold segments clog SSDs, tenants fight for partitions, and adding real-time features means bolting on yet another component. If this sounds familiar, Apache Pulsar may be the saner architecture: a multi-tenant, geo-ready streaming platform built from day one to separate compute from storage.

Why Pulsar matters (in plain English)

Pulsar is a cloud-native streaming stack that combines:

A durable message bus (topics, partitions, subscriptions)
Long-term storage (tiered, cheap, and decoupled from brokers)
Lightweight compute (Functions) and connectors (Pulsar IO)

For data engineers, that translates into:

Operational headroom for multi-tenant teams
Elastic retention without melting broker disks
Global replication patterns that won’t wreck latency budgets
Clean integration paths to OLAP and lakehouse zones

Architecture at a glance

Key pieces:

Brokers: stateless-ish front doors handling produce/consume, authn/z, topic routing.
Apache BookKeeper (bookies): append-only segment storage for ledgers (the actual message data).
Zookeeper / etcd: metadata and coordination (Pulsar can use different metadata stores depending on version/distribution).
Namespaces & Tenants: first-class multi-tenancy boundaries for quotas, auth, and isolation.
Tiered Storage: offload cold segments to S3/GCS/ABFS while keeping hot segments on bookies.
Pulsar Functions & IO: in-stream transforms and connectors (CDC, sinks, sources).
Geo-Replication: topic-level replication between clusters/regions.

Mental model: Unlike Kafka’s monolithic commit log per partition, Pulsar writes segments to BookKeeper ledgers. Segments are rotated and offloaded, so retention and history scale without pinning to broker disks. Brokers can scale independently from storage.

Core concepts (quick tour)

Topic: persistent://<tenant>/<namespace>/<topic>
Subscriptions:
- Exclusive: one consumer per sub.
- Shared: competing consumers share messages (parallelism).
- Failover: active/standby consumers with ordered failover.
- Key_Shared: maintains key-level order while scaling consumers.
Schema registry: Avro/JSON/Protobuf with compatibility checks.
Transactions: cross-topic atomic writes/reads for exactly-once pipelines (with care).
Retention & offload: set by namespace; cold data moves to object storage.

Pulsar vs Kafka (no fluff)

Capability	Pulsar	Kafka
Storage model	Segmented ledgers via BookKeeper; decoupled compute/storage	Monolithic log per partition; broker-coupled storage
Tiered storage	Built-in, native policy	Add-on(s) or vendor features
Multi-tenancy	First-class tenants & namespaces	Possible via ACLs & naming; not first-class
Subscription modes	Exclusive, Shared, Failover, Key_Shared	Consumer groups (shared); no native key_shared equivalent
Geo-replication	Native, per-topic/namespace	MirrorMaker / other tooling
Functions/IO	Built-in serverless-style	ksqlDB / Kafka Connect (separate services)
Exactly-once	Transactions available	EOS via idempotent producers + transactions
Ops surface	Brokers stateless-er; storage on bookies	Brokers handle both compute + storage

Bottom line: Pulsar shines for multi-tenant, long-retention, globally replicated workloads. Kafka shines when you want the vast ecosystem and a single-tenant, broker-centric model you already know.

Real example: Python producer/consumer with schema and Key_Shared

# pip install pulsar-client==3.*  (version may vary)
import pulsar
from pulsar import AuthenticationToken
from pulsar import ConsumerType
from pulsar import schema

client = pulsar.Client(
    service_url="pulsar://broker.pulsar.svc.cluster.local:6650",
    authentication=AuthenticationToken("YOUR_TOKEN"),
)

class Order(schema.Record):
    order_id = schema.String()
    user_id = schema.String()
    amount  = schema.Float()
    ts      = schema.Long()

topic = "persistent://ecom/prod/orders"

# Producer with schema + keying for ordered processing per user
producer = client.create_producer(
    topic,
    schema=schema.AvroSchema(Order),
)

producer.send(
    Order(order_id="o-1", user_id="u-42", amount=19.99, ts=1732540800000),
    partition_key="u-42"
)

# Key_Shared consumer keeps per-key ordering while scaling out
consumer = client.subscribe(
    topic,
    subscription_name="billing-service",
    consumer_type=ConsumerType.KeyShared,
    schema=schema.AvroSchema(Order),
)

msg = consumer.receive(timeout_millis=3000)
try:
    record = msg.value()
    # process(record)
    consumer.acknowledge(msg)
except Exception:
    consumer.negative_acknowledge(msg)

producer.close()
consumer.close()
client.close()

Why this matters:

Schema gives you evolution and safety.
Keyed messages + Key_Shared = per-user order guarantees while scaling horizontally.
Ack/Nack controls at-least-once delivery; add transactions if you need end-to-end exactly-once across topics/sinks.

Pulsar SQL (ad hoc analytics on streams)

Pulsar integrates with Presto/Trino-style SQL for querying topics:

-- Example: read from a Pulsar topic via Trino catalog
SELECT user_id,
       COUNT(*) AS orders,
       SUM(amount) AS revenue
FROM pulsar."ecom/prod".orders
WHERE ts >= 1732512000000  -- from a given epoch
GROUP BY user_id
ORDER BY revenue DESC
LIMIT 10;

Use this for exploration, anomaly checks, small dashboards, not heavy OLAP. For large-scale analytics, offload to object storage and query via your lakehouse (e.g., Iceberg/Delta on S3).

Best practices (what actually works)

Data modeling & topics

Keep topics narrow and intentional. Overloaded “kitchen sink” topics create tangled consumers.
Use partition keys that control hot spotting (e.g., hash(user_id), bucket(ts)).
Put related topics in the same namespace to inherit sensible retention/offload policies.

Schemas

Make schemas mandatory. Enable compatibility checks (BACKWARD or FULL).
Prefer Avro/Protobuf for evolution; avoid schemaless JSON in critical pipelines.

Throughput & ordering

Use Key_Shared to maintain per-key order with horizontal scale.
Tune batching for producers; start with defaults then increase max_messages / max_bytes before cranking linger_ms.

Retention & cost

Set namespace-level retention by time/size.
Turn on tiered storage early. Keep hot hours on bookies; offload days/weeks to S3/GCS.

Exactly-once

Start with at-least-once + idempotent sinks.
Adopt transactions when you truly need cross-topic atomicity (accept the extra coordination overhead).

Multi-tenant & security

One tenant per team/domain; one or more namespaces per environment (dev/stage/prod).
Use role-based auth with tokens or mTLS; set quotas (rate, storage) at namespace/tenant.

Geo-replication

Replicate only what you must.
Understand latency tax and conflict strategies if multiple regions produce to the same logical stream.

Ops

Treat brokers as stateless; scale them for CPU/network.
Size bookies for IOPS and durability. Monitor ledger replication and disk utilization.
Regularly validate offload jobs and cursor health (no unbounded backlogs).

Common pitfalls (and how to avoid them)

“Kafka mental model” applied blindly: Pulsar’s segmented storage changes retention and compaction behavior—learn the namespace/offload knobs.
Hot partitions from naive keys (e.g., country code): use hashing or bucketing to spread load.
Skipping schemas: You’ll pay later with brittle consumers. Turn schemas on day one.
Storing everything hot: Bookies are not your data lake. Offload to object storage and query there.
Overusing Functions for heavy ETL: Great for glue logic, not for large joins/windows. Offload to Flink/Spark when complexity grows.

Real-world patterns you can ship

Real-time CDC → materialized views
Debezium → Pulsar → Functions (light transforms) → Sink to Redis / MongoDB for read models.
Event-driven ML features
Pulsar → Feature computation (Flink/Spark) → Write to feature store → Online inference apps consume.
Global audit stream with cheap history
Multi-tenant namespaces, strict schemas, 90 days hot, 7 years in S3 via tiered storage.

Internal link ideas (add these on your site)

“Designing Idempotent Sinks for At-Least-Once Streams”
“Choosing Keys: Avoiding Hot Partitions in Event Streams”
“Tiered Storage Patterns for Cheap, Long Retention”
“Exactly-Once: When Transactions Are Worth It (and when not)”
“Kafka vs Pulsar: Migration Checklist and Gotchas”

Conclusion & Takeaways

Pulsar earns its keep when you need clean multi-tenancy, elastic retention, and global replication without chaining more services. The segmented storage + tiered offload model solves a real pain: keep hot data fast and cheap, keep history cheaper. Start simple—schemas on, Key_Shared where you need order, offload early—and scale brokers and bookies independently as traffic grows.

TL;DR

Separate compute from storage for saner ops.
Schemas by default; evolve safely.
Keyed ordering without losing parallelism.
Tiered storage keeps costs predictable.
Use transactions sparingly; prefer idempotency.

Image prompt

“A clean, modern architecture diagram of Apache Pulsar: brokers, BookKeeper bookies, tenants/namespaces, topics with partitions, geo-replication arrows, and tiered storage to S3. Minimalist, high contrast, 3D isometric style.”

Data/ML Engineer Blog

Apache Pulsar for Data Engineers: When a Message Bus Becomes a Streaming Platform

Why Pulsar matters (in plain English)

Architecture at a glance

Core concepts (quick tour)

Pulsar vs Kafka (no fluff)

Real example: Python producer/consumer with schema and Key_Shared

Pulsar SQL (ad hoc analytics on streams)

Best practices (what actually works)

Common pitfalls (and how to avoid them)

Real-world patterns you can ship

Internal link ideas (add these on your site)

Conclusion & Takeaways

Image prompt

Tags

YOU MAY HAVE MISSED

Monitoring 101 for Data Engineers

Materialized Views in the Real World

Kafka Ingestion with Apache Doris Routine Load

Structured Logging 101