Why should you use a Canary in your Apache Kafka coal mine?

The Strimzi Cluster Operator is great for deploying your Kafka cluster on Kubernetes and handling everything related to it. It helps with brokers configuration, managing the security aspects, doing the needed upgrades, scaling up and down and much more. But … what if you want to have more insight on Kafka cluster behavior? What if you want to know if the Kafka brokers are operating correctly so that clients can exchange messages?

This is where a new component in the Strimzi family comes into the picture … please welcome the Canary!

How can a canary help in a coal mine?

When a Kafka cluster is successfully deployed using the Strimzi operator, the next step is to use it with your applications or, depending on your use case, allowing external users to integrate with your solution and use it. Monitoring and alerting are other aspects to consider if you want to know if the Kafka cluster is working correctly, however much you are using it. It’s really useful to have information about producing and consuming latency, connection times, errors on exchanging messages, and so on. If your Kafka cluster is exposed to external users, you might want to know if they are able to connect with and use the cluster. The Canary simulate activity to identify problems from a user perspective, even when the cluster seems to be operating correctly. Another use case could be when you are handling a fleet of Kafka clusters that you want to update to a newer version. By rolling with Canary and updating a few clusters only, you can check that everything is fine before continuing with the rest of the fleet.

These are the main problems that the Canary tries to help with. Canary is not a replacement for monitoring solutions that use Kafka cluster metrics. For example, with Prometheus and Grafana set up, you can use example Grafana dashboards provided by Strimzi for monitoring.

Instead, like the miners who used canaries (sentinel species), as an early warning system of toxic gases in mines, the Strimzi Canary can be used to raise alerts when a Kafka cluster is not operating as expected.

What does the Canary provide?

The Canary tool comprises several internal services. Each service has a different purpose:

A topic service creates and handles a dedicated topic, which is named __strimzi_canary by default. The topic service is used by the Canary producer and consumer to exchange messages.
A producer service sends messages to the Canary topic with a payload containing an identification number and a timestamp. The service evaluates the end-to-end latency on the consumer side.
A consumer service receives the messages sent by the producer through the Canary topic.
A connection service checks the connection to the brokers periodically. The service opens and then immediately closes a connection to each broker just to validate that it’s reachable.
A status service that aggregates some of the producer and consumer metrics exposing them on a dedicated HTTP endpoint in a JSON format.

When the Canary starts for the first time, the topic service creates the dedicated topic with a number of partitions equals to the number of brokers, so that for each broker there is one partition. If the Kafka cluster is scaled up, the service adds more partitions to the topic assigning them in order to have the new partitions on the new brokers. The objective is always having one partition per broker. When the producer and consumer exchange messages through the topic, in effect, they are “testing” all the brokers in the cluster using all partitions. If the Kafka cluster is scaled down, because it’s not possible to remove partitions from a topic, the Canary will just avoid using excess partitions (reusing them if the cluster is scaled up again).

While producer and consumer services exchange messages through the dedicated topic, different metrics are gathered and exposed via the HTTP /metrics endpoint in Prometheus format. This endpoint can be scraped by a Prometheus server and the metric values can be used for defining alerts as well as being showed on Grafana dashboards. Information like the number of produced and consumed records, errors on sending or receiving records, producing and end to end latency are just a few of them.

...
# HELP strimzi_canary_records_produced_total The total number of records produced
# TYPE strimzi_canary_records_produced_total counter
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="0"} 1479
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="1"} 1479
strimzi_canary_records_produced_total{clientid="strimzi-canary-client",partition="2"} 1479
# HELP strimzi_canary_records_consumed_total The total number of records consumed
# TYPE strimzi_canary_records_consumed_total counter
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="0"} 1479
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="1"} 1479
strimzi_canary_records_consumed_total{clientid="strimzi-canary-client",partition="2"} 1479
...

Other interesting metrics are those providing latency histogram buckets in order to be able to process corresponding quantiles and define alerts on them. With Canary, you can configure the buckets for the Kafka cluster based on resource usage that might impact latency, such as CPU or memory. The Canary repository includes a Python script that can be used to calculate the average producer/consumer/connection latencies from Canary logs. This is enabled through the VERBOSITY_LOG_LEVEL environment variable. You can determine if the histogram buckets for the latency metrics are set optimally or extract those which are useful for your own Kafka cluster from the log.

...
I1107 15:13:26.820577       1 producer.go:86] Sending message: value={"producerId":"strimzi-canary-client","messageId":6718,"timestamp":1636298006820} on partition=0
I1107 15:13:28.042882       1 producer.go:100] Message sent: partition=0, offset=2252, duration=152 ms
I1107 15:13:28.042908       1 producer.go:86] Sending message: value={"producerId":"strimzi-canary-client","messageId":6719,"timestamp":1636298008042} on partition=1
I1107 15:13:28.043382       1 consumer.go:193] Message received: value={ProducerID:strimzi-canary-client, MessageID:6718, Timestamp:1636298006820}, partition=0, offset=2252, duration=223 ms
I1107 15:13:28.167955       1 producer.go:100] Message sent: partition=1, offset=2252, duration=125 ms
I1107 15:13:28.168024       1 producer.go:86] Sending message: value={"producerId":"strimzi-canary-client","messageId":6720,"timestamp":1636298008168} on partition=2
I1107 15:13:28.168741       1 consumer.go:193] Message received: value={ProducerID:strimzi-canary-client, MessageID:6719, Timestamp:1636298008042}, partition=1, offset=2252, duration=126 ms
I1107 15:13:28.421214       1 producer.go:100] Message sent: partition=2, offset=2252, duration=253 ms
...

As well as the useful /metrics endpoint, you can use the /status endpoint to expose aggregated metrics data in JSON format. This can be useful for automation where you don’t have the ability to scrape and decode the text Prometheus format, but your tool can easily understand JSON text. Right now the only data provided through this endpoint is the percentage of messages that are produced and consumed successfully by the Canary over a specific time window.

{
  "Consuming": {
    "TimeWindow": 300000,
    "Percentage": 100
  }
}

Everything described above is totally configurable using several environment variables, from the period at which messages are exchanged to the periodic connection check, from the histogram latency buckets to the time window for status data. Other than that, the Canary supports configuration for connecting to a cluster where TLS is enabled as well as different SASL authentication mechanisms like PLAIN and SCRAM-SHA. You can get more information in the “Configuration” section of the project landing page.

Allow the canary to fly in the mine

The Canary tool is not integrated within the Strimzi operator (yet); it could be a future improvement on the project in order to deploy it via a Kafka custom resource configuration. Right now the release-related installation files provide all you need to run the Canary on a Kubernetes cluster alongside your Kafka cluster. A Deployment file is available with a basic configuration to connect to a Kafka cluster without TLS support and exchange messages every 10 seconds. You’ll need to specify the bootstrap server address.

env:
  - name: KAFKA_BOOTSTRAP_SERVERS
    value: my-cluster-kafka-bootstrap:9092
  - name: RECONCILE_INTERVAL_MS
    value: "10000"
  - name: TLS_ENABLED
    value: "false"

Together with the deployment, the installation provides a Service for exposing the 8080 port that is used for reaching the HTTP endpoints corresponding to metrics, status, liveness, and readiness.

Conclusion

Monitoring a Kafka cluster by verifying that it’s behaving and working properly is really simple thanks to the Canary tool. Even when your cluster is not heavily used, it helps to check that everything is working fine for the time when the real applications start to use it. It is also useful for generating traffic for validating A/B deployments and upgrades.

The Canary has just hatched with its first release, so it is important for us to hear from the community about questions, issues, and any suggestion for improvements!

Any feedback and contribution are welcome, as always :)