Drawbacks of using Kafka as a task processing queue
Kafka is great for a lot of things, but using it as a task-processing queue is not where it shines, and I’ll tell you a few reasons why.
Glossary
cluster
= A Kafka cluster is usually made up of a few brokers who use a
predefined protocol to keep each other updated about their status and the status
of the topics and partitions assigned to them.
broker
= The Kafka core system. It is where data lives and manages the flow
of messages between itself and the consumers. It is responsible for storing,
replicating and purging the data in topics. The broker is usually run with
ephemeral disks
to cope with the high demands of IO requirements.
topic
= The Kafka way of describing an independent stream of data.
partition
= A subset of a topic containing a unique message set. A
message is part of only one partition, decided by the message’s partition key.
message
= The Kafka concept of “atomic grouping of data”. This is what
publishers send to Kafka and what consumers receive.
job
= A predefined task to be executed in a non-interactive way. Jobs are
usually used for long-running or costly functions that are not suitable for
running in an on-demand and interactive way.
Unequal submission versus processing cost
In Kafka land, submitting a message to the broker is a very cheap and fast operation. However, for most tasks, processing the job is neither fast nor cheap - thus creating an imbalance in the system and putting pressure on the consumers to keep up with the production rate.
In addition, Kafka has quite tight deadlines on the keepalive
settings of the
consumers. It is not unexpected to have jobs that run for more than 60s - which
would be how long the Kafka consumer needs to be out of action before Kafka
will mark it as dead. There are workarounds for this - but that’s what they
are - workarounds!
Kafka does not auto-scale
Kafka topics have a preset number of partitions which signal the maximum concurrency that can be achieved when reading from that topic. If one is doing any work before ACK’ing the message and advancing the offset - that will determine the available throughput for that topic based on how long those operations last.
For example, if one needs to read the message and do an update that takes, on average, 100ms, that means in 1 second, you can consume ten messages per partition on average. So if one needs to process 1000 messages per second, one will need 100 partitions.
If faced with a high lag situation due to message processing being slower than incoming messages, one can increase the number of partitions which will affect future messages. However, the messages on the existing partitions will still need to be processed from those partitions, as Kafka will not move those to new partitions.
Further details of why a high number of partitions can cause problems are described in this Confluent article.
Kafka does not provide a native retry mechanism
Natively, Kafka does not deal with failure
scenarios in consumption. Suppose
the consumer crashes when processing a message. In that case, it will not
advance the offset of that partition and repeatedly try and fail to process the
latest message, thus causing the lag to increase as new messages are added to
the topic continually. The failure of jobs creates the need for dealing with
the situation, which is done commonly by either:
- Manual intervention to move the offset for the partition.
- The code is patched to no longer crash but to ignore the unprocessable message.
This creates situations where if, for example, jobs can fail (not a usual case), one needs to come up with various methods of either:
- Putting the message back on the topic
- Create another queue for messages to be retried later.
Eventually, consistently failing messages will end up on a DLQ
(dead
letter queue), which must be manually set up and monitored.
A paper from Uber details their approach to dealing with exponentially backing off retries: link.
Kafka has high overheads for high-lag topics
Due to the way Kafka brokers are set up with an ephemeral
storage model, if
topics have a consistently high amount of lag, this will pressure the
broker’s storage. In turn, if a topic is misconfigured, it can lead to
the broker running out of space and thus crashing.
When a broker crashes, Kafka will have to move all of the existing data from the partitions assigned to that broker to the newly spawned broker. If there was a lot of data in the partitions (due to high lag), moving that data from its replicated copies will take a long time and will put further stress on the other brokers due to the increased required network throughput.
Depending on the publishing settings of the Kafka clients, during this time, one could be in situations where messages cannot be published on the partitions and topics affected.
Rebalancing consumers involves a stop-the-world
moment
If one has a large topic judged by the number of consumers (e.g., 40+), whenever
a consumer dies (either due to the pod being shuffled, losing connection, etc.),
the Kafka broker has to perform a rebalance
operation. This is an expensive
operation as what happens is:
- The broker signals to all consumers to stop consuming.
- Consumption is stopped.
- The broker decides the best way to split the existing partitions among the current consumers. If there are many partitions and consumers, this can take time, as the broker will try to assign equal work to all consumers.
- Once all partitions have been assigned, consumers will need to ACK their assignments.
- Once all consumers have ACK’ed their partitions, consumption may restart.
During the rebalance, the publishing of messages is not blocked - but the message consumption is stopped. This leads to a period where the topic will accumulate lag, so having headroom in your processing calculations is a must.
The more consumers you have, the more likely is to suffer from these
rebalancing. In the k8s
world, your pod dies - triggers a rebalance,
and a new pod starts - triggering another rebalance, further exacerbating the
issue.
There are other effects of the rebalancing, which can put pods under more
CPU/Memory pressure as most consumers have settings that reflect
their expected consumption rates from a set number of partitions.
Conclusion
Kafka can be great as a way to get messages from system A to system B, but when it comes to using a job processing queue, it’s not one of its strong points.
Of course, anything is possible, and there are a lot of use cases in which Kafka can be tuned to perform exceptionally well in, even considering all of the above points; however, the question should be - is that the best way of achieving those results?
Could there be a simpler, more robust, purpose-built architecture rather than pushing Kafka into spaces it was not designed for?
An example of a more flexible approach to a task processing queue would be using Redis with a mature library like 1, 2, 3, 4, 5. Most cloud services will have a managed version of Redis that one can use, and the throughput that Redis can provide will be enough to sustain very demanding job processing applications. A demonstration from the Elasticache AWS shows 200k+ TPS - links.