Kafka

The one-liner in the Apache Kafka documentation that explains what Apache Kafka is in a nutshell,

Apache Kafka® is an event streaming platform.

What’s an event?

It is simply a change in the state of something. Consider you are booking food online. You have confirmed your order. It is an event. Now your payment is successful. It is an event. If your food has arrived at your location, it is also an event.

Secondly, the streaming platform is the one that takes the continuous flow of events, processes them, and interprets them, to provide some meaning to the end consumer.

Let’s take the simple analogy of a newspaper. The newspaper is a collection of all events that have happened around the world, gathered by people from various locations, interpreted and processed in a user-readable format. In a way, it is an indirect ancestor of Kafka.

But Kafka is not just this. It is a lot more than just a collection of events and processing of events. That’s why it is one of the most popular events streaming platforms used by various industry giants including NETFLIX, PayPal, Spotify, Twitter and Airbnb.

Advantages of using Kafka based messaging and streaming systems

In a traditional messaging system, it is always the responsibility of the producer of messages to ensure that the message reaches the end consumer. But this is not the case in Kafka. The producers and consumers of the message are anonymous to each other. And that is the huge advantage of Kafka.

Kafka has topics, to which the producers will push events and the end consumers will keep polling for new messages on that topic.

To achieve this, Kafka introduces asynchronicity into the picture. Hence our application thread also becomes non-blocking, It makes scaling of producers, consumers and managing resources so much simple and easy.

We also achieve high flexibility and reliability through the retention policies and replication options provided by Kafka.

Let’s take a deep dive into the various parts that make up what Kafka is.

Records

These are the event data that go into and out of the Kafka server. Continuing with the newspaper analogy, these are the actual information that you read for which you have a heading, body and some other related data. Similarly, the records pushed to the Kafka server have a key, value and other metadata to interpret and process.

Producers

These are the clients for the Kafka server. They produce the events and push them into Kafka. The producers are like the various sources that provide the information to a newspaper org. These sources provide information for that particular section concerned. And similarly, the producers produce events always for the same topic. But what is a topic?

Topics

The topics are the foundational constituents for Kafka’s workflow. Consider these as different sections in a newspaper. The information related to a section is compiled together and presented under that section. And similarly, all the events emitted by the producer have a topic that will be maintained together under that topic name. Then who’s gonna use that information?

Consumers

The consumers are the other set of clients for the Kafka server. The consumers are similar to the newspaper readers. The readers of a newspaper or a magazine are subscribed to that topic/section and consume all the information that’s available as part of it. And similarly, the consumers keep polling for new events for the topic they are interested in. For a large set of consumers, how can we ensure the data is provided without any failure or loss?

Brokers

The brokers are the heart of Kafka architecture. Each broker is assigned a topic and they store all the events that are published on that topic. To ensure there is no data loss and to avoid a single point of failure, we replicate the data using the replication factor while setting up the broker. And in most cases, there is always a cluster that maintains multiple brokers listening to the topics for this purpose. These brokers are nothing but the newspaper org itself that collect info from producers and publish it to the consumers.

Replication factor

The replication factor is used to denote the number of replications of a piece of information to be stored in the broker’s file system. In case of a particular broker failing or going down, the other broker in that cluster having the replicated data will come forward and give the required piece of event data to the consumer. This is like a backup feature. But do we store this data forever?

Retention Policy

We do not store the event data forever and there is no point in doing so for the Kafka server if that data is already consumed and processed by the client. This retention policy determines how long to store this data in the Kafka broker’s file system. By default, it is 7 days.

Partitions and Offsets

When a consumer goes down for some time and comes back, how does it know where to start reading data? The offsets are used to make sure the ordering is guaranteed when the clients read the event data from the brokers. And the partitions are used to split the topic’s event data storage into different buckets across brokers to ensure scalability and reliability. Offsets are the indexes represented as 0,1,2, etc in the diagram above.

Zookeeper

On top of all of these, the one main leader of everything is the Zookeeper. Because it’s quite hard to ensure that all these units in the Kafka architecture work well together in orchestration. And Zookeeper makes sure that everything in the Kafka setup is fine. It receives heartbeats from brokers for specified intervals to ensure everything is smooth and fine.

Cons of Apache Kafka

It’s not all rainbows and sunshine, though. There are also some drawbacks to using Kafka. You might have already felt overwhelmed being introduced to so many components in the very beginning. And that too for a beginner this is too complex for a messaging system.

Also going through the various components, you have come across a lot of configurations that can modify the flow of events like the number of brokers, replication factor, retention policy etc. And these are definitely a lot of knobs to turn to achieve the optimal functioning of your system.

Also when working with Kafka, we can confirm high throughput through the async nature introduced into the environment. But because of this, there is high latency. Hence there is a trade-off between throughput and latency.

Conclusion

Apache Kafka is a very popular framework used by multiple industry giants. There are a lot of reasons to use Kafka in your project and the most important among all of them, Kafka is an async Pub-Sub based event-driven model. And, a heads-up before including Kafka in your project — Kafka is highly effective when there is a lot of events flowing through the system. For relatively small projects, Kafka is overkill.

Kafka ​

What’s an event? ​

Advantages of using Kafka based messaging and streaming systems ​

Records ​

Producers ​

Topics ​

Consumers ​

Brokers ​

Replication factor ​

Retention Policy ​

Partitions and Offsets ​

Zookeeper ​

Cons of Apache Kafka ​

Conclusion ​