Redis Pub/Sub vs Kafka: Which Model Fits Backend Eventing?
Backend teams eventually face the same question: should we distribute this event with Redis Pub/Sub, or should we use Kafka from day one? Both look like pub/sub systems at first glance, but their failure behavior, delivery semantics, and operational characteristics are fundamentally different. A wrong choice can feel fast in the short term, then become expensive when you need replay, auditing, or strict consistency between services.
The goal of this guide is not to declare one winner. It is to provide a technical decision framework: match each tool to the right event class by looking at delivery guarantees, ordering, latency, backpressure, replay requirements, and operational overhead together.
Problem framing: not all events are equal
In production, event streams have different risk profiles:
- Some events are real-time signals where occasional loss is acceptable (for example, transient live dashboard updates).
- Some events are business-critical facts (payment captured, invoice created, subscription renewed).
- Some consumers can be offline for minutes or hours and still need to catch up safely.
Redis Pub/Sub is excellent for ephemeral fan-out. A publisher sends a message, and subscribers connected at that moment receive it. If a subscriber is down, the message is gone. Kafka is built around a durable append-only log. Events are retained in topics, consumers track offsets, and late consumers can replay from earlier positions.
This leads to the first design question: what happens if a consumer is unavailable for ten minutes? If the answer is "nothing must be lost and it must catch up later," Kafka is usually the stronger default.
Architectural differences: ephemeral channel vs durable log
Redis Pub/Sub
- Message delivery is memory-first and transient.
- The broker pushes messages but does not keep a durable per-subscriber history.
- Setup is simple and operational friction is low for small to medium systems.
- It is very effective for low-latency, real-time signaling patterns.
Kafka
- Events are stored durably on disk with configurable retention.
- Consumer offsets provide explicit progress tracking and recovery.
- Replay, reprocessing, and audit use cases are first-class.
- Partitioning enables high throughput and parallel consumption, but partition-key design is crucial.
Because of this, choosing between them is often less about raw throughput and more about failure semantics. Redis Pub/Sub optimizes for immediacy; Kafka optimizes for durable event history and controlled recovery.
Delivery guarantees, ordering, and trade-offs
Classic Redis Pub/Sub is effectively best effort. Network interruptions, subscriber restarts, and deployment windows can cause message loss. You can improve durability with Redis Streams, but that is a different model than plain Pub/Sub and introduces consumer-group semantics you must operate.
Kafka is commonly used with at-least-once delivery. With sane producer settings (acks=all, idempotent producer) and explicit consumer commit logic, duplicate handling becomes manageable. "Exactly once" still requires end-to-end thinking: database writes, external API calls, and idempotency at the application boundary all matter.
Ordering also has nuance:
- Kafka does not guarantee global ordering across all partitions, but it does preserve order inside each partition.
- Redis Pub/Sub can be hard to reason about for ordering under reconnects and distributed subscriber behavior.
If your domain needs per-entity order (for example, all events of one account must be processed in sequence), Kafka with a stable partition key like accountId is typically more predictable.
Scenario mapping: when each one fits best
Redis Pub/Sub is a strong fit for
- Real-time notifications where occasional loss is tolerable.
- WebSocket fan-out patterns that prioritize immediate delivery.
- Lightweight in-system signaling (for example cache invalidation triggers) where fallback paths exist.
Kafka is a strong fit for
- Business-critical event pipelines that require traceability and replay.
- Multiple independent consumer groups processing the same event stream for different purposes.
- Event-driven systems where new services must bootstrap by replaying historical events.
A hybrid architecture is often pragmatic: publish source-of-truth domain events to Kafka, then derive fast transient notifications through Redis for UX responsiveness. This gives durability and speed without forcing a single tool to solve every concern.
Failure modes and anti-patterns
A common anti-pattern is sending critical domain events only through Redis Pub/Sub. Everything appears fine until a restart, failover, or network flap introduces silent message loss and downstream inconsistencies.
Another anti-pattern is using Kafka with poor partition strategy, such as funneling too much traffic into one hot partition. This increases consumer lag, lowers throughput, and creates uneven load.
Do not skip these controls:
- Define producer retry and timeout policies deliberately.
- Make consumers idempotent so replay and duplicate delivery are safe.
- Introduce dead-letter handling for poison messages.
- Version event schemas to avoid breaking downstream consumers.
Metrics and observability
You can only validate architecture decisions with runtime evidence. Key metrics include:
- End-to-end event latency (publish to side-effect completion)
- Consumer lag growth and recovery behavior (especially for Kafka)
- Retry and failure rates per consumer group
- Dead-letter volume and reprocessing success rate
- Throughput and message-size distribution
For Redis-heavy flows, track subscriber disconnect frequency and impact. For Kafka, alert not only on absolute lag thresholds but also on lag acceleration, which often reveals developing bottlenecks earlier.
Production readiness checklist
- Classify event types into "critical" and "transient."
- Document durability and replay requirements for each class.
- Define idempotency strategy at producer and consumer boundaries.
- Standardize topic/channel naming and ownership.
- Write an explicit backpressure policy (queue, shed, degrade).
- Set SLOs: p95 latency, acceptable data loss, maximum lag.
- Apply schema versioning and compatibility checks in CI.
Redis Pub/Sub and Kafka are not interchangeable clones; they solve different reliability envelopes. Redis Pub/Sub is a great fit for low-latency transient signaling. Kafka is a safer backbone for durable, replayable, auditable event streams. As systems mature, using both intentionally is often the most practical architecture.
Related posts
Cache Stampede Prevention Patterns for Hot Key Protection
Techniques such as singleflight, request coalescing, and stale-while-revalidate to reduce cache stampedes under high load.
CQRS Read Model Consistency: Stale Data, Lag, and Recovery Strategies
Comprehensive guide for handling read model lag, eventual consistency, and rebuild workflows in CQRS systems.
WebSocket Horizontal Scaling: Presence, Fan-out, and State Synchronization
Scaling WebSocket systems horizontally with practical strategies for node sync, sticky sessions, and message distribution.