Apache Pulsar for Data Engineers: When a Message Bus Becomes a Streaming Platform

Hook: Your Kafka cluster groans under “one topic = one history forever.” Cold segments clog SSDs, tenants fight for partitions, and adding real-time features means bolting on yet another component. If this sounds familiar, Apache Pulsar may be the saner architecture: a multi-tenant, geo-ready streaming platform built from day one to separate compute from storage.


Why Pulsar matters (in plain English)

Pulsar is a cloud-native streaming stack that combines:

  • A durable message bus (topics, partitions, subscriptions)
  • Long-term storage (tiered, cheap, and decoupled from brokers)
  • Lightweight compute (Functions) and connectors (Pulsar IO)

For data engineers, that translates into:

  • Operational headroom for multi-tenant teams
  • Elastic retention without melting broker disks
  • Global replication patterns that won’t wreck latency budgets
  • Clean integration paths to OLAP and lakehouse zones

Architecture at a glance

Key pieces:

  • Brokers: stateless-ish front doors handling produce/consume, authn/z, topic routing.
  • Apache BookKeeper (bookies): append-only segment storage for ledgers (the actual message data).
  • Zookeeper / etcd: metadata and coordination (Pulsar can use different metadata stores depending on version/distribution).
  • Namespaces & Tenants: first-class multi-tenancy boundaries for quotas, auth, and isolation.
  • Tiered Storage: offload cold segments to S3/GCS/ABFS while keeping hot segments on bookies.
  • Pulsar Functions & IO: in-stream transforms and connectors (CDC, sinks, sources).
  • Geo-Replication: topic-level replication between clusters/regions.

Mental model: Unlike Kafka’s monolithic commit log per partition, Pulsar writes segments to BookKeeper ledgers. Segments are rotated and offloaded, so retention and history scale without pinning to broker disks. Brokers can scale independently from storage.


Core concepts (quick tour)

  • Topic: persistent://<tenant>/<namespace>/<topic>
  • Subscriptions:
    • Exclusive: one consumer per sub.
    • Shared: competing consumers share messages (parallelism).
    • Failover: active/standby consumers with ordered failover.
    • Key_Shared: maintains key-level order while scaling consumers.
  • Schema registry: Avro/JSON/Protobuf with compatibility checks.
  • Transactions: cross-topic atomic writes/reads for exactly-once pipelines (with care).
  • Retention & offload: set by namespace; cold data moves to object storage.

Pulsar vs Kafka (no fluff)

CapabilityPulsarKafka
Storage modelSegmented ledgers via BookKeeper; decoupled compute/storageMonolithic log per partition; broker-coupled storage
Tiered storageBuilt-in, native policyAdd-on(s) or vendor features
Multi-tenancyFirst-class tenants & namespacesPossible via ACLs & naming; not first-class
Subscription modesExclusive, Shared, Failover, Key_SharedConsumer groups (shared); no native key_shared equivalent
Geo-replicationNative, per-topic/namespaceMirrorMaker / other tooling
Functions/IOBuilt-in serverless-styleksqlDB / Kafka Connect (separate services)
Exactly-onceTransactions availableEOS via idempotent producers + transactions
Ops surfaceBrokers stateless-er; storage on bookiesBrokers handle both compute + storage

Bottom line: Pulsar shines for multi-tenant, long-retention, globally replicated workloads. Kafka shines when you want the vast ecosystem and a single-tenant, broker-centric model you already know.


Real example: Python producer/consumer with schema and Key_Shared

# pip install pulsar-client==3.*  (version may vary)
import pulsar
from pulsar import AuthenticationToken
from pulsar import ConsumerType
from pulsar import schema

client = pulsar.Client(
    service_url="pulsar://broker.pulsar.svc.cluster.local:6650",
    authentication=AuthenticationToken("YOUR_TOKEN"),
)

class Order(schema.Record):
    order_id = schema.String()
    user_id = schema.String()
    amount  = schema.Float()
    ts      = schema.Long()

topic = "persistent://ecom/prod/orders"

# Producer with schema + keying for ordered processing per user
producer = client.create_producer(
    topic,
    schema=schema.AvroSchema(Order),
)

producer.send(
    Order(order_id="o-1", user_id="u-42", amount=19.99, ts=1732540800000),
    partition_key="u-42"
)

# Key_Shared consumer keeps per-key ordering while scaling out
consumer = client.subscribe(
    topic,
    subscription_name="billing-service",
    consumer_type=ConsumerType.KeyShared,
    schema=schema.AvroSchema(Order),
)

msg = consumer.receive(timeout_millis=3000)
try:
    record = msg.value()
    # process(record)
    consumer.acknowledge(msg)
except Exception:
    consumer.negative_acknowledge(msg)

producer.close()
consumer.close()
client.close()

Why this matters:

  • Schema gives you evolution and safety.
  • Keyed messages + Key_Shared = per-user order guarantees while scaling horizontally.
  • Ack/Nack controls at-least-once delivery; add transactions if you need end-to-end exactly-once across topics/sinks.

Pulsar SQL (ad hoc analytics on streams)

Pulsar integrates with Presto/Trino-style SQL for querying topics:

-- Example: read from a Pulsar topic via Trino catalog
SELECT user_id,
       COUNT(*) AS orders,
       SUM(amount) AS revenue
FROM pulsar."ecom/prod".orders
WHERE ts >= 1732512000000  -- from a given epoch
GROUP BY user_id
ORDER BY revenue DESC
LIMIT 10;

Use this for exploration, anomaly checks, small dashboards, not heavy OLAP. For large-scale analytics, offload to object storage and query via your lakehouse (e.g., Iceberg/Delta on S3).


Best practices (what actually works)

Data modeling & topics

  • Keep topics narrow and intentional. Overloaded “kitchen sink” topics create tangled consumers.
  • Use partition keys that control hot spotting (e.g., hash(user_id), bucket(ts)).
  • Put related topics in the same namespace to inherit sensible retention/offload policies.

Schemas

  • Make schemas mandatory. Enable compatibility checks (BACKWARD or FULL).
  • Prefer Avro/Protobuf for evolution; avoid schemaless JSON in critical pipelines.

Throughput & ordering

  • Use Key_Shared to maintain per-key order with horizontal scale.
  • Tune batching for producers; start with defaults then increase max_messages / max_bytes before cranking linger_ms.

Retention & cost

  • Set namespace-level retention by time/size.
  • Turn on tiered storage early. Keep hot hours on bookies; offload days/weeks to S3/GCS.

Exactly-once

  • Start with at-least-once + idempotent sinks.
  • Adopt transactions when you truly need cross-topic atomicity (accept the extra coordination overhead).

Multi-tenant & security

  • One tenant per team/domain; one or more namespaces per environment (dev/stage/prod).
  • Use role-based auth with tokens or mTLS; set quotas (rate, storage) at namespace/tenant.

Geo-replication

  • Replicate only what you must.
  • Understand latency tax and conflict strategies if multiple regions produce to the same logical stream.

Ops

  • Treat brokers as stateless; scale them for CPU/network.
  • Size bookies for IOPS and durability. Monitor ledger replication and disk utilization.
  • Regularly validate offload jobs and cursor health (no unbounded backlogs).

Common pitfalls (and how to avoid them)

  • “Kafka mental model” applied blindly: Pulsar’s segmented storage changes retention and compaction behavior—learn the namespace/offload knobs.
  • Hot partitions from naive keys (e.g., country code): use hashing or bucketing to spread load.
  • Skipping schemas: You’ll pay later with brittle consumers. Turn schemas on day one.
  • Storing everything hot: Bookies are not your data lake. Offload to object storage and query there.
  • Overusing Functions for heavy ETL: Great for glue logic, not for large joins/windows. Offload to Flink/Spark when complexity grows.

Real-world patterns you can ship

  • Real-time CDC → materialized views
    Debezium → Pulsar → Functions (light transforms) → Sink to Redis / MongoDB for read models.
  • Event-driven ML features
    Pulsar → Feature computation (Flink/Spark) → Write to feature store → Online inference apps consume.
  • Global audit stream with cheap history
    Multi-tenant namespaces, strict schemas, 90 days hot, 7 years in S3 via tiered storage.

Internal link ideas (add these on your site)

  • “Designing Idempotent Sinks for At-Least-Once Streams”
  • “Choosing Keys: Avoiding Hot Partitions in Event Streams”
  • “Tiered Storage Patterns for Cheap, Long Retention”
  • “Exactly-Once: When Transactions Are Worth It (and when not)”
  • “Kafka vs Pulsar: Migration Checklist and Gotchas”

Conclusion & Takeaways

Pulsar earns its keep when you need clean multi-tenancy, elastic retention, and global replication without chaining more services. The segmented storage + tiered offload model solves a real pain: keep hot data fast and cheap, keep history cheaper. Start simple—schemas on, Key_Shared where you need order, offload early—and scale brokers and bookies independently as traffic grows.

TL;DR

  • Separate compute from storage for saner ops.
  • Schemas by default; evolve safely.
  • Keyed ordering without losing parallelism.
  • Tiered storage keeps costs predictable.
  • Use transactions sparingly; prefer idempotency.

Image prompt

“A clean, modern architecture diagram of Apache Pulsar: brokers, BookKeeper bookies, tenants/namespaces, topics with partitions, geo-replication arrows, and tiered storage to S3. Minimalist, high contrast, 3D isometric style.”

Tags

#ApachePulsar #Streaming #DataEngineering #EventDriven #RealTime #Scalability #KafkaAlternative #Architecture #BookKeeper #TieredStorage