Managing Kafka with Strimzi: Part 1

Ken Wagatsuma
Source Diving
Published in
5 min readJun 29, 2021

--

Cookpad Logo, https://www.cookpadteam.com/

Background

Cookpad is a global recipe sharing platform with users in over 70 countries across the world. At Cookpad we use Apache Kafka, a distributed event streaming platform to make real time data available to the various backend services that power our platform.

The other day we completed a platform migration project for Apache Kafka that we’ve been working on for the past six months. Previously we used a managed service called Confluent Cloud, but we switched to using Strimzi to deploy Kafka on a Kubernetes cluster built on Amazon EKS.

It was a huge project that involved many engineers from different teams working on the Cookpad platform. As a member of our SRE team I was in charge of the process. This involved design, PoC implementation, setting up monitoring and alerting, and the actual migration work along with two other SRE members. It was a very emotional moment when the last production application was switched over. We experienced some challenges due to our lack of experience in Kafka operations, but it was a great experience for us.

After the migration project was completed, I had a great time celebrating with the other members of my team at the local pub, wearing our project T-shirts. There was a lot behind the decision to switch from a managed service to a self hosted solution, which I hope to share in this blog post.

We’ll have to wait and see what happens next, but for now, let’s take a look at Strimzi.

What is Strimzi?

Strimzi is a “Kubernetes Operator for running Apache Kafka”, a set of software related to Strimzi is developed at github.com/strimzi Organization, and in particular the Operator itself is developed at github.com/strimzi/strimzi-kafka-operator.

Kubernetes Operator is an architectural pattern for deploying and operating arbitrary applications (in this case, Apache Kafka and Apache Zookeeper) using a Custom Resource Definition (CRD). CRD is an architectural pattern for deploying and running arbitrary applications (in this case, Apache Kafka and Apache Zookeeper).

Strimzi was adopted as a CNCF Sandbox Project on 28/08/2019, and Core Contributors/Maintainres from Red Hat are prominent.

According to the blog “Insights from the first Strimzi survery” published on 29/09/2020, more than 40% of it users are using it in production, 25% of which are running Kafka clusters on Amazon EKS just like us.

For example, if we wanted to deploy Apache Kafka on Kubernetes by ourselves, we would need to deploy not only the Kafka cluster but also various subcomponents.

  • Apache Kafka
  • Apache Zookeeper
  • Kafka Connect Cluster (if required)
  • Kafka Mirror Maker (if required)
  • Kafka Bridge (if required)
  • RBAC (Role, ClusterRole, binding resoures, etc.)
  • Load Balancers
  • components/configuration for Monitoring (e.g. prometheus JMX exporter)
  • components/configuration for ACL (e.g. users)

Writing a full YAML file from scratch to deploy all these applications would be really hard unless you were already an experienced Kafka operator, with plenty of Kubernetes experiance. There is a lot more to do, such as setting up the network between containers, distributing mTLS and Client Certification, setting up Persistent Volume (PV), setting up Topics, and so on.

By using the API provided by the operator (expressed in the form of CRD schema), Strimzi makes it much simpler to operate a Kafka cluster.

Of course, abstraction has some drawbacks.

  • Not all of Kafka’s features are supported by Strimzi.
  • New releases of Kafka are not typically supported until about a month after release.
  • Often to understand fully how Strimzi is performing operations, we have to go and read the source code of the Operator, which is written in Java.

Ultimately, we decided to use Strimzi because it is more reliable than managing all subcomponents in YAML by ourselves, it has been adopted by CNCF and is used in production by other companies, and it is promising and maintainable.

Why Strimzi?

The basic premise that we had to decide was if we should use a managed service, or run a Kafka cluster by ourselves.

If you want to use a managed service on AWS, you could choose Confluent Cloud or Amazon MSK. First of all, Confluent Cloud is a good service. If you choose to continue using a managed service, Confluent Cloud should remain your first choice. Confluent employs most of the core Apache Kafka developers, so they know how to operate Kafka reliably at scale, and have excellent support.

The main reason why we switched from Confluent Cloud to Strimzi is because we have made a company-wide decision to run our own Kafka clusters. As the base of our main application is gradually starting to use Kafka, and as it becomes more and more important, we need the flexibility of data backup, the flexibility of configuring Kafka clusters and getting metrics, and the flexibility of the architecture around networking. We are motivated to accelerate the event-driven architecture around Kafka clusters even at the expense of maintenance and implementation costs.

Having made the decision to run Kafka cluster by ourselves, it was natural to choose to run it on Kubernetes because we have many members with Kubernetes experience and skills, and our mid to long term strategy for our platform is to use Kubernetes company-wide.

We decided to use Strimzi because, as mentioned earlier, the benefits of using an Operator over creating our own Kubernetes Manifests outweighed any drawbacks.

We also considered Confluent Operator, but didn’t look at it closely, because it requires an Enterprise Subscription to the Confluent Platform.

Conclusion

This is my introduction to Strimzi. In the rest of this series I will cover the implementation process and some interesting things we learnt about Kafka and Kubernetes.

We found that Strimzi was a great fit for us, because we wanted more control and Flexibility of Kafka than we were offered by a managed service, but were not ready to develop tools to manage a complex distributed system. Throughout this project we gained lots of technical knowledge about how Kafka works and are much more confident about the operation of this core component in our platform.
If you are just getting started with Kafka, like we were a couple of years ago, a managed service like Confluent Cloud is a great fit and will save you a lot of time. If you have a team who can invest the time in picking up the knowledge and skills to operate Kafka in house, then Strimzi is a useful tool that can streamline operating your own Kafka Clusters.

This project not only gave me a technical insight into Kafka and Strimzi, but also involved working closely with different engineers across Cookpad, who’s applications rely on a reliable Kafka service.

We’re hiring!

Credit: this blog post was written by Kenju Wagatsuma and reviewed by Ed.

--

--