Skip to content

Troubleshooting Kafka cluster

Why do that

If Kafka Monitoring prints Red Status and processing components fails with kafka errors (worker logs), you may need to read this procedure.

What to do

1) Check the operating system health on all kafka servers

On supervision system (like Nagios), on Kibana dedidated to monitoring or directly on local with the following commands:

  • Free disk space
1
$ df -h
  • Used Memory
1
$ free -m
  • IO wait
1
$ top (then press 1)

take a look on sections

At this step, all must be in Green status. No huge variations in the last 24 hours, or high load average.

If there is a partition full or a lack of memory. Ask Infra support Team to check capacity planning, and fix the issue.

2) Check the Cluster health

With a topic availability point of view

1
$ punchplatform-kafka-topics.sh --describe [--kafkaCluster <cluster_name>]

The ouput must look like:

1
2
3
4
5
kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: 1   Replicas: 1,2   Isr: 2,1
Topic:mytenant_ufw    PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_ufw Partition: 0    Leader: 2   Replicas: 2,3   Isr: 2,3

Explanations:

There is 2 topics: mytenant_apache and mytenant_ufw

The topic mytenant_apache has a shard on node 1 and an other on node 2. The leader of the partition is the node 1. The topic mytenant_ufw has a shard on node 2 and an other on node 3. The leader of the partition is the node 2.

Regular errors:

error 1

Output:

1
2
3
kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: 1   Replicas: 1,2   Isr: 1

Error:

The topic is available but not replicated. Workarround: Check status of broker 2

error 2

Output:

1
2
3
kafka topic for kafka cluster 'local'...
Topic:mytenant_apache PartitionCount:1    ReplicationFactor:2 Configs:
  Topic: mytenant_apache  Partition: 0    Leader: -1  Replicas: 1,2   Isr:

Error:

The topic is not available. You are in an incident. Workarround: Check status of broker 1 and 2

With a leader point of view

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
loic@server4:~/pp-conf-livedemo$ punchplatform-zookeeper-console.sh
Connecting to server4:2181
2018-03-05 15:11:28,556 [myid:] - INFO  [main:Environment@100] - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-03-05 15:11:28,563 [myid:] - INFO  [main:Environment@100] - Client environment:host.name=server4
2018-03-05 15:11:28,564 [myid:] - INFO  [main:Environment@100] - Client environment:java.version=1.8.0_151
2018-03-05 15:11:28,567 [myid:] - INFO  [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.class.path=/data/opt/zookeeper-3.4.10/bin/../build/classes:/data/opt/zookeeper-3.4.10/bin/../build/lib/*.jar:/data/opt/zookeeper-3.4.10/bin/../lib/slf4j-log4j12-1.6.1.jar:/data/opt/zookeeper-3.4.10/bin/../lib/slf4j-api-1.6.1.jar:/data/opt/zookeeper-3.4.10/bin/../lib/netty-3.10.5.Final.jar:/data/opt/zookeeper-3.4.10/bin/../lib/log4j-1.2.16.jar:/data/opt/zookeeper-3.4.10/bin/../lib/jline-0.9.94.jar:/data/opt/zookeeper-3.4.10/bin/../zookeeper-3.4.10.jar:/data/opt/zookeeper-3.4.10/bin/../src/java/lib/*.jar:/data/opt/zookeeper-3.4.10/bin/../conf:
2018-03-05 15:11:28,568 [myid:] - INFO  [main:Environment@100] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:java.compiler=<NA>
2018-03-05 15:11:28,569 [myid:] - INFO  [main:Environment@100] - Client environment:os.name=Linux
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:os.arch=amd64
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:os.version=4.4.0-101-generic
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:user.name=loic
2018-03-05 15:11:28,570 [myid:] - INFO  [main:Environment@100] - Client environment:user.home=/home/loic
2018-03-05 15:11:28,571 [myid:] - INFO  [main:Environment@100] - Client environment:user.dir=/home/loic/pp-conf-livedemo
2018-03-05 15:11:28,574 [myid:] - INFO  [main:ZooKeeper@438] - Initiating client connection, connectString=server4:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@5387f9e0
Welcome to ZooKeeper!
2018-03-05 15:11:28,616 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server server4/217.182.66.64:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2018-03-05 15:11:28,784 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@876] - Socket connection established to server4/217.182.66.64:2181, initiating session
2018-03-05 15:11:28,802 [myid:] - INFO  [main-SendThread(server4:2181):ClientCnxn$SendThread@1299] - Session establishment complete on server server4/217.182.66.64:2181, sessionid = 0x160317a132800b9, negotiated timeout = 30000  

WATCHER::  

WatchedEvent state:SyncConnected type:None path:null
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-l  

punchplatform-log-injector   punchplatform-livedemo
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/  

spark-2.2.0-main   kafka-consumer     storm-1.1.1-main   admin              shiva              kafka-local        plan
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-  

kafka-consumer   kafka-local
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/  

cluster                    controller_epoch           controller                 brokers                    admin                      isr_change_notification
consumers                  latest_producer_id_block   config
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/brokers/ids
[1, 2, 3]

The last line is important !

Explanations:

The kafka cluster contains 3 nodes. All are present in the cluster.

Regular errors:

1
2
[zk: server4:2181(CONNECTED) 0] ls /punchplatform-livedemo/kafka-local/brokers/ids
[1, 3]

Get the logs (to send them to punchplatform support N3 team after the incident) in /var/log/punchplatform/kafka/

Then, restart the broker 2.

To know, who is the broker 2, check the punchplatform.properties (kafka section). The list if an ordered list of servers.

Do not restart broker if are not sure about the root cause. You can lose logs (many...)

3) Finally:

Send logs and shorts explanations to punchplatform support N3 team about your incident. We will be happy to improve the platform configuration.