22 March 2019
Part IBack in January 2019, I presented an introduction to Kafka basics and spring-kafka at a South Bay JVM User Group meetup. I gave a birds-eye view of what Kafka offers as a distributed streaming platform. We explored a few key concepts and dove into an example of configuring spring-Kafka to be a producer/consumer client. In this two-part blog series, I’ll further discuss this topic and how we’re developing with Kafka at mobileforming.
Apache Kafka was created at LinkedIn, and what was initially conceived as a messaging queue became the backbone to move data from one system to another. Kafka is suitable for building real-time data pipelines and streaming applications. Some use cases that come to my mind are messaging, website activity tracking, metrics, log aggregation, stream processing, event sourcing, and commit logs.
Now that we know what Kafka is, it’s time to ask: why Kafka? Is it because it’s new and shiny? I’d argue that it’s fast, reliable, scalable—among other things.
Let’s start with producers. Producers write data to brokers and are responsible for load balancing. Next, we see consumers. Consumers request a range of messages from a broker. They are responsible for their own state. A message is a record that consists of a key, value, and timestamp.
It’s also important to note that data is stored as a stream of records in a topic. Topics are multi-subscriber. Each topic is split into partitions, and partitions are then replicated.
Partitions are ordered and immutable sequence of messages continually appended to. The records in each partition are assigned a sequential ID number called the offset. All published records are durably persisted by the Kafka cluster and retained based on a configurable retention period. Partitions determine the maximum consumer (group) parallelism allowed.
Consumer groups come to consensus via zookeeper and broker leaders. Consumers within a group are evenly distributed among available partitions—i.e a consumer in the same group will not share the same partition. Uneven consumers in a group may not get associated with a partition.
Replicas are backups of a partition solely to prevent data loss. Replicas are never read from and never written to. They do not help to increase producer or consumer parallelism.
Brokers receive messages from a producer (push), and deliver messages to consumers (pull). They’re responsible for some partitions, and they also keep copies of other partitions. They use language agnostic TCP protocol to communicate between producers and consumers. Brokers are typically run as a cluster on one or more servers that can span multiple data centers. Partitions are distributed and replicated over brokers/servers in the cluster.
Each server acts as a leader for some of its partitions and a follower for others, so the load is well balanced within a cluster. If the leader fails, one of the followers will automatically become the new leader.
A zookeeper is required for Kafka cluster operations. Zookeepers have the following responsibilities:
- Membership of the cluster: list of all the brokers that are functioning at any given moment and are part of the cluster
- Controller election: whenever a node shuts down, a new controller can be elected and it can also be made sure that at any given time, there is only one controller and all the follower nodes have agreed on that
- Configuration of topics: the configuration regarding all the topics—including the list of existing topics, the number of partitions for each topic, the location of all the replicas, topics configuration overrides, and which node is the preferred leader, etc.
- Access control lists for all the topics
- There have been up to 2 million writes/sec - 3 producers, 3x async replication
- About 2.5 million reads/sec - 3 parallel consumers reading a topic
- End-to-end latency: 2 ms (median), 3 ms (99th percentile), and 14 ms (99.9th percentile)
- Throughput vs. size
To sum up this first part of getting to know Kafka: it’s fast—as proven by sequential reads/writes page cache, it being lightweight, and the simple protocols. It’s scalable—cluster management and partitioned, distributed queues make it easy to spin up new brokers and support very large number of producers and consumers. Kafka is reliable with its data replication and fault tolerance. It’s durable as it persists messages to disk and retains even after consumption.
Having streams API built in gives a greater advantage to have custom business logic drive events. However, Kafka is not without its faults. The most common issues are with consumer lag and consumer rebalancing. Clients tend to be heavy as a majority of the message handling logic is its own responsibility. If one does not have a good client, message consumption will have issues.
Keep an eye out for Part II in which I’ll discuss how we integrate with Kafka.