Troubleshooting Kafka cluster and brokers health¶

Why do that¶

If Kafka Monitoring prints Red Status and processing components fails with kafka errors (worker logs), you may need to follow this procedure.

What to do¶

1) Check the operating system health on all kafka servers¶

On supervision system (like Nagios), on Kibana dedicated to monitoring or directly on local with the following commands:

Free disk space (on kafka brokers machines)

df -h /data/kafka

Used Memory

free -m

IO wait

top (then press 1)

take a look on sections

At this step, all must be in Green status. No huge variations in the last 24 hours, or high load average.

If there is a partition full or a lack of memory. Ask Infra support Team to check capacity planning, and fix the issue.

2) Check the Cluster health and brokers health/replication status¶

With a topic availability point of view¶

punchplatform-kafka-topics.sh --describe [--kafkaCluster <cluster_name>]

The output must look like:

kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: 1   Replicas: 1,2   Isr: 2,1
Topic:mytenant_ufw    PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_ufw Partition: 0    Leader: 2   Replicas: 2,3   Isr: 2,3

Explanations:

There are 2 topics: mytenant_apache and mytenant_ufw

The topic mytenant_apache has a shard on node 1 and another on node 2. The leader of the partition is the node 1. The topic mytenant_ufw has a shard on node 2 and 3. The leader of the partition is the node 2.

Shortcut for finding not-nominal partitions/topic health

To see only unhealthy partitions/topics, you can add one of the following parameters to the punchplatform-kafka-topics.sh --kafkaCluste <myCluster> --describe command :

--under-replicated-partitions will show only partitions that are available for reading (and writing in a non-replicated way), but that are not replicated as configuration requires (meaning that a broker is unhealthy).
--unvailable-partitions will show only partitions that are not available for reading and writing ( because no leader is elected among available broker nodes, which probably means one or more unavailble brokers)

Frequent error situation encountered:

error 1

Output:

kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: 1   Replicas: 1,2   Isr: 1

** Error:

The topic is available but not replicated. Workaround: Check status of broker 2

error 2

Output:

kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: -1  Replicas: 1,2   Isr:

Error:

The topic is not available. You are in an incident. Workaround: Check status of broker 1 and 2

With a leader point of view (viewing live brokers, and how to contact them)¶

The cluster management information is stored in zookeeper.

Using zookeeper console tool, it is possible to check which brokers are declaring to be alive, and what network address/interface can be used to connect to them, through the broker ids zookeeper node:

loic@server4:~/pp-conf-livedemo$ punchplatform-zookeeper-console.sh
Connecting to server4:2181
2018-03-05 15:11:28,556 [myid:] - INFO  [main:Environment@100] - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-03-05 15:11:28,563 [myid:] - INFO  [main:Environment@100] - Client environment:host.name=server4
2018-03-05 15:11:28,564 [myid:] - INFO  [main:Environment@100] - Client environment:java.version=1.8.0_151
2018-03-05 15:11:28,567 [myid:] - INFO  [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.class.path=/data/opt/zookeeper-3.4.10/bin/../build/classes:/data/opt/zookeeper-3.4.10/bin/../build/lib/*.jar:/data/opt/zookeeper-3.4.10/bin/../lib/slf4j-log4j12-1.6.1.jar:/data/opt/zookeeper-3.4.10/bin/../lib/slf4j-api-1.6.1.jar:/data/opt/zookeeper-3.4.10/bin/../lib/netty-3.10.5.Final.jar:/data/opt/zookeeper-3.4.10/bin/../lib/log4j-1.2.16.jar:/data/opt/zookeeper-3.4.10/bin/../lib/jline-0.9.94.jar:/data/opt/zookeeper-3.4.10/bin/../zookeeper-3.4.10.jar:/data/opt/zookeeper-3.4.10/bin/../src/java/lib/*.jar:/data/opt/zookeeper-3.4.10/bin/../conf:
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:java.compiler=<NA>
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:os.name=Linux
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:os.arch=amd64
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:os.version=4.4.0-101-generic
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:user.name=loic
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:user.home=/home/loic
2018-03-05 15:11:28,571 [myid:] - INFO  [main:Environment@100] - Client environment:user.dir=/home/loic/pp-conf-livedemo
2018-03-05 15:11:28,574 [myid:] - INFO  [main:ZooKeeper@438] - Initiating client connection, connectString=server4:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@5387f9e0
Welcome to ZooKeeper!
2018-03-05 15:11:28,616 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server server4/217.182.66.64:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2018-03-05 15:11:28,784 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@876] - Socket connection established to server4/217.182.66.64:2181, initiating session
2018-03-05 15:11:28,802 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@1299] - Session establishment complete on server server4/217.182.66.64:2181, sessionid = 0x160317a132800b9, negotiated timeout = 30000  

WATCHER::  

WatchedEvent state:SyncConnected type:None path:null
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-l  

punchplatform-log-injector   punchplatform-livedemo
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/  

spark-2.3.0-main   kafka-consumer     storm-1.1.1-main   admin              shiva              kafka-local        plan
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-  

kafka-consumer   kafka-local
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/  

cluster                    controller_epoch           controller                 brokers                    admin                      isr_change_notification
consumers                  latest_producer_id_block   config
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/brokers/ids
[1, 2, 3]

The last line is important !

Here the 1, 2, 3 are actually ephemeral zookeeper nodes that are written by alive brokers. If a broker is down or unconnected to the zookeeper cluster, the nodes will disappear. The node integer is the 'broker id' configured by the deployment settings.

Explanations:

The kafka cluster contains 3 nodes. All are present in the cluster.

Regular errors:

[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/brokers/ids
[1, 3]

Get the logs (to send them to punchplatform support N3 team after the incident) in /var/log/punchplatform/kafka/

Then, restart the broker 2.

To know, who is the broker 2, check the punchplatform.properties (kafka section). The list if an ordered list of servers.

Do not restart broker if are not sure about the root cause. You can lose logs (many...)

3) Troubleshooting problems to reach live brokers¶

Sometimes, a node is declaring to be alive, but some client application fails to access the broker, leading to kafka unavailability from the client application or punchline point of view.

It is possible to read published data of a broker :

get /punchplatform-training-central/kafka-front/brokers/ids/1
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},"endpoints":["PLAINTEXT://tpfrkaf01:9092"],"jmx_port":-1,"host":"tpfrkaf01","timestamp":"1604774368206","port":9092,"version":4}

This provides us with the address and port that will be used to reach the broker from other servers (here, a short name 'tpfrkaf01'). Of course, if the name and ports published here are not reachable from some other broker, or kafka client node, then it may imply some firewall rule or routing issue occurs at networking level (or maybe the name is not known from the client machine ?.

4) Finally:¶

Send logs and shorts explanations to punchplatform support N3 team about your incident. We will be happy to help on your incident, and improve the platform documentation or tooling if needed.