When running applications on OpenShift or Kubernetes, different pods get colocated on the same nodes. That is very useful, because it helps to decrease costs by better utilizing the available hardware. But it can also cause some issues. One of which is the noisy neighbour phenomenon. This is a situation when multiple applications which are intensively utilizing the same resource - for example network bandwidth or disk I/O - are scheduled on the same node. How do you avoid such problems when running Apache Kafka on OpenShift or Kubernetes with Strimzi? And how do you make sure your Apache Kafka delivers the best performance?
Most Apache Kafka users want the best possible performance from their clusters. Good performance does not only mean that most messages are delivered with super low-latency or super high-throughput. It also means that the performance of the cluster is constant in time and without any significant spikes. Noisy neighbors are just one possible cause of performance issues.
There are several ways of getting the best out of Apache Kafka:
- Make sure Kafka pods are not scheduled on the same node as other performance intensive applications
- Make sure Kafka pods are scheduled on the nodes with the most suitable hardware
- Use nodes which are dedicated to Kafka only
Since the 0.5.0 release, Strimzi supports all of these. This blog post will show you how to do this.
Pod Anti-Affinity
Pod affinity and anti-affinity is a Kubernetes feature which allows you to specify constraints for pod scheduling. It allows users to specify whether the pod should be scheduled on the same node (affinity) as another pod or on a different node (anti-affinity). The pods which should be included into the scheduling constraint are specified using a label selector. The scheduling could be either preferred or required. The preferred scheduling defines a soft constraint - it will give you only a best-effort guarantee. When there is no way schedule the pod according to the constraint, the pod will be scheduled in a way which doesn’t match the constrain. The required scheduling is a hard constraint. If the constraint cannot be met, the pod will not be scheduled. It will not be deployed and your cluster will be missing some nodes.
Strimzi supports pod affinity for Kafka, Zookeeper, Kafka Connect and Topic Operator.
You can use it to specify the pods which should never run on the same node as Kafka pods.
Affinity can be specified in the Custom Resources (CR) under the affinity
property.
The typical workload which you want to avoid sharing the node with are other applications which are network or disk I/O intensive.
Such as databases, for example.
The example below shows a Kafka
resource with a pod anti-affinity specified.
It will require Kubernetes to schedule the Kafka pods on nodes where there are no other pods with labels application=postgresql
or application=mongodb
.
The rule can be easily adapted to include other applications or different labels.
The topologyKey
field specifies that the pods matching the selector should not be sharing the same hostname - which means they will be scheduled on different nodes.
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: application
operator: In
values:
- postgresql
- mongodb
topologyKey: "kubernetes.io/hostname"
...
zookeeper:
...
Since the example is using requiredDuringSchedulingIgnoredDuringExecution
constraint (the required scheduling), you have to make sure that your cluster has enough different nodes to accomodate all pods with different scheduling requirements.
Or you can switch to using preferredDuringSchedulingIgnoredDuringExecution
constraint which will allow Kubernetes more flexibility in scheduling the pods.
Node affinity
Not all nodes are born equal. It is quite common that a big Kubernetes or OpenShift cluster consists of many different types of nodes. Some are optimized for CPU heavy workloads, some for memory, while other might be optimized for storage (fast local SSDs) or network. Using different nodes helps to optimize both costs and performance. But as a user of such a heterogeneous cluster you need to be able to schedule your workloads to the right node.
To schedule workloads onto specific nodes, Kubernetes has a feature called node affinity.
Node affinity allows you to create a scheduling constraint for the node on which the pod will be scheduled.
The constraint is once again specified as a label selector.
You can either use it for the built-in node label such as beta.kubernetes.io/instance-type
.
Or you can use your own labels to select the right node.
Node affinity can be also specified in the affinity
field of our custom resources for Kafka, Zookeeper and Kafka Connect.
The example below shows configuration which will ensure that Kafka pods will be scheduled only on nodes whose node-type
label is equal to fast-network
.
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- fast-network
...
zookeeper:
...
When needed, you can also combine node affinity together with pod affinity:
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- fast-network
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: application
operator: In
values:
- postgresql
- mongodb
topologyKey: "kubernetes.io/hostname"
...
zookeeper:
...
Dedicated nodes
Pod affinity is a great way to influence pod scheduling, but it’s not enough entirely prevent the noisy neighbour problem. With dedicated nodes you can make sure that there will be only Kafka pods and system services such as log collectors or software defined networks sharing the node. There will be no other pods scheduled on such machine which could affect or disturb the performance of the Kafka brokers.
To create a dedicated node, you have to first taint it.
Taints are a feature of nodes which can be used to repel pods.
Only pods which tolerate a given taint can be scheduled on such nodes.
The nodes can be tainted using the kubectl
tool:
$ kubectl taint nodes ip-10-0-0-124.ec2.internal dedicated=Kafka:NoSchedule
node "ip-10-0-0-124.ec2.internal" tainted
The taint has always has a key, value and effect.
In the example above, the key is dedicated
, the value is kafka
and the effect is NoSchedule
.
Setting this taint will make sure that regular pods will not be scheduled to the tainted node.
If you have any cluster-wide services running as pods (such as Fluentd DeamonSets for collecting logs) you have to make sure that they will tolerate the taint as well. Otherwise you will have no logs from your Kafka pods.
The Toleration for the Kafka pods is configured in the custom resource with the key toleration
.
It is supported for Kafka, Zookeeper and Kafka Connect.
But setting the toleration is not enough.
Toleration tells Kubernetes that the pod can be scheduled on the tainted node.
It doesn’t say that the pods couldn’t be scheduled anywhere else.
So we need to combine the toleration with node affinity to make sure that the pod will be scheduled only on the dedicatd node.
To do that we need to label the dedicated node as well, so that we can use the label in the node affinity selector:
$ kubectl label nodes ip-10-0-0-124.ec2.internal dedicated=Kafka
node "ip-10-0-0-124.ec2.internal" labeled
Once the taint and the label are set, we can deploy the Kafka cluster. The example below shows a Kafka custom resource which configures the matching toleraton and node selector.
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
...
tolerations:
- key: "dedicated"
operator: "Equal"
value: "Kafka"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- Kafka
...
zookeeper:
...
This will make sure that only the Kafka pods will be running on the dedicated nodes.
Practical example
To show in detail how the dedicated nodes work, I deployed a 3 node Kafka cluster into my Kubernetes cluster running in AWS. My Kubernetes cluster had 1 master node and 6 worker nodes:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-0-124.ec2.internal Ready <none> 7m v1.10.5 34.238.153.57 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-0-236.ec2.internal Ready <none> 33s v1.10.5 54.152.210.78 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-0-60.ec2.internal Ready master 7m v1.10.5 35.171.124.109 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-1-115.ec2.internal Ready <none> 7m v1.10.5 107.23.251.223 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-1-18.ec2.internal Ready <none> 35s v1.10.5 54.152.8.252 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-2-31.ec2.internal Ready <none> 7m v1.10.5 184.72.149.131 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
ip-10-0-2-61.ec2.internal Ready <none> 55s v1.10.5 54.209.246.85 CentOS Linux 7 (Core) 3.10.0-862.3.2.el7.x86_64 docker://18.6.0
I picked 3 nodes on which I set the taints and labels:
$ kubectl taint nodes ip-10-0-0-124.ec2.internal dedicated=Kafka:NoSchedule
node "ip-10-0-0-124.ec2.internal" tainted
$ kubectl taint nodes ip-10-0-1-115.ec2.internal dedicated=Kafka:NoSchedule
node "ip-10-0-1-115.ec2.internal" tainted
$ kubectl taint nodes ip-10-0-2-31.ec2.internal dedicated=Kafka:NoSchedule
node "ip-10-0-2-31.ec2.internal" tainted
$ kubectl label nodes ip-10-0-0-124.ec2.internal dedicated=Kafka
node "ip-10-0-0-124.ec2.internal" labeled
$ kubectl label nodes ip-10-0-1-115.ec2.internal dedicated=Kafka
node "ip-10-0-1-115.ec2.internal" labeled
$ kubectl label nodes ip-10-0-2-31.ec2.internal dedicated=Kafka
node "ip-10-0-2-31.ec2.internal" labeled
As a result, the cluster has 3 dedicated nodes for Kafka andf 3 nodes for Zookeeper, Strimzi operators and Kafka clients. Now we can deploy the Strimzi Cluster Operator which is scheduled on one of the nodes which are not dedicated to Kafka:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
strimzi-cluster-operator-586d499cd7-bzqsg 1/1 Running 0 1m 192.168.29.1 ip-10-0-2-61.ec2.internal
With the Cluster Operator running, Kafka can be deployed using following resource:
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
replicas: 3
readinessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
livenessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
tolerations:
- key: "dedicated"
operator: "Equal"
value: "Kafka"
effect: "NoSchedule"
storage:
type: ephemeral
rack:
topologyKey: "failure-domain.beta.kubernetes.io/zone"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- Kafka
zookeeper:
replicas: 3
readinessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
livenessProbe:
initialDelaySeconds: 15
timeoutSeconds: 5
storage:
type: ephemeral
topicOperator: {}
The Cluster Operator will deploy the StatfulSets for Zookeeper and Kafka as well as the Topic Operator. But only the Kafka pods will be scheduled to the dedicated nodes:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-cluster-kafka-0 2/2 Running 0 2m 192.168.158.65 ip-10-0-1-115.ec2.internal
my-cluster-kafka-1 2/2 Running 0 2m 192.168.221.65 ip-10-0-2-31.ec2.internal
my-cluster-kafka-2 2/2 Running 0 2m 192.168.226.67 ip-10-0-0-124.ec2.internal
my-cluster-topic-operator-fb6cb47d-qqjrz 2/2 Running 0 1m 192.168.180.194 ip-10-0-1-18.ec2.internal
my-cluster-zookeeper-0 2/2 Running 0 2m 192.168.180.193 ip-10-0-1-18.ec2.internal
my-cluster-zookeeper-1 2/2 Running 0 2m 192.168.17.65 ip-10-0-0-236.ec2.internal
my-cluster-zookeeper-2 2/2 Running 0 2m 192.168.29.2 ip-10-0-2-61.ec2.internal
strimzi-cluster-operator-586d499cd7-bzqsg 1/1 Running 0 9m 192.168.29.1 ip-10-0-2-61.ec2.internal
Conclusion
In this blog we’ve explained that the noisy neighbour problem is and how Kubernetes/OpenShift’s pod affinity features can be used to mitigate its effects. Pod affinity is not enough to eliminate noisy neighbours. To do that you have to use taints and tolerations to create a dedicated set of nodes for running your workload. Since Strimzi 0.5.0 it’s possible to use these features with your Kafka and Kafka Connect clusters to get the best performance from the hardware available in your Kubernetes or OpenShift cluster.