Apache Kafka: distributed messaging for the data era

Apache Kafka introduces a distributed messaging system based on append-only logs, partitioning, consumer groups and configurable retention to treat data as a continuous stream.

Open SourceNetworking Open SourceApache KafkaMessagingStreamingDistributed Systems

From LinkedIn to the open source world

Apache Kafka originates inside LinkedIn to solve a concrete problem: handling the data flows generated by hundreds of millions of daily events — page views, profile updates, operational metrics — and making them available in real time to dozens of different systems. Existing messaging systems such as ActiveMQ or RabbitMQ are designed for traditional message queues: point-to-point or publish-subscribe delivery with message deletion after consumption. LinkedIn needs something different: a system that treats data as a continuous stream and retains it to be re-read by multiple independent consumers.

The project, developed by Jay Kreps, Neha Narkhede and Jun Rao, is released as open source and donated to the Apache Software Foundation.

A distributed log as foundation

Kafka’s architecture rests on a simple concept: the append-only log. Every message published to a topic is appended to a log file and identified by a sequential offset. Messages are not deleted after reading: they remain available for a period defined by the configured retention, which can span hours, days or be indefinite.

Each topic is divided into partitions distributed across multiple cluster nodes called brokers. Partitioning enables horizontal scalability: more partitions mean more throughput, because producers and consumers can operate in parallel on different partitions. Partition replication across different brokers ensures fault tolerance.

Consumer groups and decoupling

Consumers organise into consumer groups. Within a group, each partition is assigned to exactly one consumer, ensuring every message is processed once by the group. Different groups read independently from the same topic, each maintaining its own offset. This model allows a real-time analytics system, an indexing process and a notification service to all read the same data simultaneously without interfering with one another.

The producer publishes messages without knowing who will consume them. The consumer reads from the log at its own pace, without the broker needing to manage delivery state. Decoupling is complete: producers and consumers do not need to be active simultaneously.

Data as a stream

Kafka changes the perspective on messaging: data are not messages to deliver and forget, but a persistent stream from which every system draws what it needs, when it needs it. For organisations generating significant volumes of events — application logs, metrics, transactions — Kafka provides an infrastructure on which to build reliable and scalable data pipelines.

Link: kafka.apache.org

Need support? Under attack? Service Status
Need support? Under attack? Service Status