What Is Apache Kafka & What Can You Use It For?
LinkedIn, Microsoft, and Netflix all use Apache Kafka to process over 1 trillion messages each day. One-third of all Fortune 500 companies use Kafka in their tech stack, including major banks, insurers, and telecom providers. For such widespread usage, most people have never heard of Apache Kafka. So, what is Kafka and what can it be used for?
In this article, we’ll explore Apache Kafka’s exploding popularity. We’ll also give some insights and use cases for why top companies, and some of our clients, are using Kafka.
Kafka = Message Queue + Pub-Sub
If you’re a developer, perhaps that section heading makes sense to you. If not, let’s break it down.
Kafka takes the best parts of two different data models – message queues and publish-subscribe. In so doing, it’s able to receive thousands of inputs simultaneously and still process them sequentially.
1. Message Queue
A message queue is a data structure that allows you to store information to be picked up and used sequentially later. It follows a first-in, first-out structure, much like the queues we’re used to at the ticket booth or the fast-food drive-through. The first person in line is the first to place their order and the first to receive their item. The same is true of a message queue. The first message in the queue gets processed first.
Here, “processed” could mean any number of things. It doesn’t really matter what you’ll be doing with the data. But actually receiving simultaneous inputs and ensuring that they’ll get processed sequentially is a non-trivial technical task. Furthermore, queueing allows you to spread tasks across multiple processors (aka workers) while still being processed sequentially. When one worker finishes, they revisit the queue and receive their next task.
This is the challenge though because a queue is typically only able to support a single subscriber. We can only issue one task to one worker at a time.
Another important data model is publish-subscribe. In this model, a single publisher can send the same message to many subscribers. All a subscriber has to do is join a publishing stream, and they’ll receive everything that stream emits.
Furthermore, a subscriber can join multiple publisher streams. And a publisher can send messages to many subscribers simultaneously.
However, we can’t use publish-subscribe to assign across multiple worker processes, because every message goes to every subscriber. So, how can we simultaneously process thousands of requests, queue them, and then efficiently divide them across many worker processes instantaneously?
Kafka solves the problem of assigning tasks on high-throughput message streams. (We’ll look at practical applications in a second.)
To do so, Kafka keeps a log of incoming messages, ordered sequentially. Kafka then partitions the log in such a way that it can send a segment of the message load to each subscriber. As a result, there can be multiple subscribers (i.e. workers) to a topic, and each will receive its own segment (partition) of the message stream.
This offers massive scalability for applications with millions, billions, or trillions of messages per day.
Kafka’s Practical Applications
Kafka is often the first piece of infrastructure a company puts in place when they want to capture and perform analysis on streaming, real-time data. For instance, a company might want to track website user behavior. So, certain actions on the website could emit a message that Kafka picks up, queues, partitions, and sends to a worker in real-time to process.
Any real-time analytics or dashboards for large applications with many users and quickly changing data could use Kafka to get up-to-the-second status updates. Kafka is also really good at receiving multiple types of messages and sorting them into topics. As a result, you can send your user data, error logs, database profiling, and anything else to Kafka and let it sort through them to determine what needs to be processed now, in real-time, and what can wait to run in the background later.
Kafka is usually a piece of infrastructure that’s part of a larger data pipeline. Kafka alone won’t produce insights, but it will enable you to structure and process data to derive your own insights. While it’s not a good fit for every company, it can be a lifesaver in situations where there’s a lot of incoming data all the time.
Founded in 1991, Intertech delivers software development consulting and IT training to Fortune 500, Government and Leading Technology institutions. Learn more about us. Whether you are a developer interested in working for a company that invests in its employees or a company looking to partner with a team of technology leaders who provide solutions, mentor staff and add true business value, we’d like to meet you.