Managing Kafka with Strimzi: Part 2 — Implementation

Ken Wagatsuma
Source Diving
Published in
3 min readAug 23, 2021

--

In a previous blog post, I introduced Strimzi, a Kubernetes Operator for Apache Kafka, and discussed why we decided to manage our own Kafka cluster.

In this post, let’s see how we built our cluster with Strimzi.

Design Phase & Proof of Concept 📝

The first step was to write a DesignDoc. It describes the main principles and goals of the migration project, comparisons with alternative architectures, target SLI/SLO, and cost estimation. By writing the DesignDoc, we align why we work on this project, and what is the goal of this project.

DesignDoc

While writing and getting reviews on the DesignDoc, we embarked on deploying a Proof Of Concept (PoC) on a sandbox cluster. The deployment of the Kafka cluster was very easy thanks to the Strimzi Custom Resource Definition (CRD).

We also deployed Kafka MirrorMaker 2.0 (MM2) to bring in data from the old cluster. Deploying MM2 was easy too thanks to Strimzi support that happened last year.

Observability 🔍

As for metrics, we deployed JMX exporters and Prometheus servers on the same cluster with different namespaces. Prometheus servers are configured using prometheus-operator. We made use of kube-prometheus Jsonnet library to generate configuration for the operator. We made this metrics available on our Grafana GUI, which is our integral part of observability.

We have been already managing another EKS cluster for a while dedicated to monitoring, where we run Grafana and other observability services. We deploy Thanos, Alertmanager, and Grafana there. There are some other processes like errm/alertdog to monitor prometheus/alertmanager themselves.

As for logging, we administer an in-house logging infrastructure, which had already been built with Kinesis Stream/Kinesis Firehose/Lambda and Amazon Elasticsearch (AES) last year.

Fluentd processes are deployed as a DaemonSet all our EKS clusters, and uses awslabs/aws-fluent-plugin-kinesis to PutRecords to Kinesis Stream. Logging is viewable from Kibana/Elasticsearch. We used fabric8io/fluent-plugin-kubernetes_metadata_filter to add k8s metadata to improve the log search experience.

Kinesis Stream scales up shards automatically based on the Incoming/Outcoming workloads, which is achieved by the mixture of CloudWatch Metrics and Lambda.

Logging Architecture

Performance Test 🔨

At the end of the PoC implementation, we also set up a simple Performance Tooling mechanism, based on the kafka-consumer-perf-test and kafka-producer-perf-test provided by Kafka Tools. With these tools we defined the performance tests in YAML then deployed the test pods to the cluster.

The execution results are written to STDOUT and converted to metrics using google/mtail. Metrics are available on our Grafana dashboard, which I mentioned in the previous section.

We executed that on the sandbox cluster, so could confirm that our configuration could handle the expected throughput of production workloads.

PoC to Production 🚀

In the end, a configuration similar to the one built in the sandbox environment was deployed in the production (and pre-production) clusters.

The EKS clusters themselves are managed by cookpad/terraform-aws-eks. It provides a rather Opinionated API that allows us to deploy new clusters with minimal configuration as it is focused on our use case.

As for CI/CD, we’re currently working on improving CI/CD around k8s, but at the time of writing we’re using Jenkins/CodeBuild. Simple enough but works as expected. CI runs “kubectl apply — dry-run — validate” and “kustomize build && kubectl diff”, and “kubectl apply” is executed on CD when the changes are reflected in the master branch.

As for the middle-term plan, we’re planning to replace that with Flux2, and right now we’re gradually introducing it to some of our EKS clusters. I can’t wait to see it happen.

Conclusion

Deploying Kafka is easy thanks to Strimzi. Observability is also important to make Kafka cluster stable. Finally, it’s all about Cloud Native: improved scalability and reliability.

In the next post, I’ll talk about how we set up Networking.

Credit: this blog post was written by Kenju Wagatsuma and reviewed by Ed.

--

--