HOW TO Check or troubleshoot zookeeper cluster and servers status¶

Why do that¶

After deployment, or in case of non-nominal zookeeper status in monitoring, or in case of zookeeper-related errors in the logs of an other component of the platform (kafka logs, storm logs, storm work logs, shiva logs), you may want :

to check if the zookeeper cluster is available
to ensure status of individual zookeeper servers
For server-level zookeeper demon service checking, have sudoer access on the zookeeper server
For cluster-level troubleshooting, have access to the linux command-line of a punchplatform operator account

Prerequisites¶

Have access to a linux server that has allowed network flow toward the zookeeper server ports (usually 2181, but check your punchplatform.properties). If you can, the best is to run this from one of the zookeeper nodes, to ensure it can reach the other nodes of the cluster.

What to do¶

To remote-check active status of an individual zookeeper server¶

Send it an inbuilt health command :

  # In this example, we suppose the zookeeper node is 'server1' and is supposed to be listening on port 2181.

  ( echo ruok  >&5 ; cat <&5 ; echo )  5<>/dev/tcp/server1/2181

This command will answer :

Nothing if the daemon is active, (This does not mean that the overall cluster is available.)
'imok' if the daemon is active, (This does not mean that the overall cluster is available.)
'connection refused' if the machine/address is known and routing is correct, but either the port is wrong or the daemon not active, or some network equipment prevents the connection.
'invalid argument' if the machine cannot be resolved (naming problem)
'no route to host' if the resolved address associated to the server1 name is not reachable through configured network routes

To check cluster status¶

Two ways to do that:

using the punchplatform command-line operator environment punchplatform-zookeeper-console.sh command (possibly using --cluster <myClusterId> if multiple clusters are defined in the platform deployment.

It will display 'connected' in prompt if the cluster is up and reachable.
using the inbuild "4 letters commands":

After having found active servers (see previous part), send it an inbuilt status command :
```
  # In this example, we suppose the zookeeper node is 'server1' and is supposed to be listening on port 2181.

  ( echo status  >&5 ; cat <&5 ; echo )  5<>/dev/tcp/server1/2181
```
If the cluster is not available (or does not include the 'server1' server for network connectivity problem), then you will get : ** This ZooKeeper instance is not currently serving requests**.

If the cluster is available, you will have this kind of answer :
```
    Zookeeper version: 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
    Clients:
     /192.168.56.1:36964[0](queued=0,recved=1,sent=0)

    Latency min/avg/max: 0/0/0
    Received: 2
    Sent: 1
    Connections: 1
    Outstanding: 0
    Zxid: 0x100000008
    Mode: follower
    Node count: 95
```
The "Mode" line will indicate 'leader' if you have enquired to the server which has been elected 'leader' of the cluster.

It will indicate 'standalone' if the server configuration indicates a single-node mode (no servers list).

Why my server is not live ?¶

check /var/log/punchplatform/zookeeper/zookeeper.log

Most frequent reasons are :

service is not started (check 'sudo systemctl status zookeeper' and 'sudo systemctl start zookeeper').
zookeeper server has no 'myid' file in its data directory (e.g. /data/zookeeper/myid), or this file is empty. This file is supposed to contain the server id in the cluster (a number, starting at 1, without new line character in the file. The server id must match the number in the list of servers in the 'zoo.cfg' configuration of the zookeeper.) Some deployment errors are coming from hostname of the server inconsistency with the server name used in the punchplatform.properties zookeeper section. For proper deployment, the names in the 'hosts' array of the zookeeper cluster in the punchplatform.properties must EXACTLY match the result of executing the hostname command on the corresponding servers.
data partition is not available or is full (check the 'dataDir' value in the zoo.cfg file to find it).
not enough memory available to start zookeeper (check /var/log/syslog, and the result of 'free -h')

Why is my cluster not available ?¶

Most frequent reasons are :

not enough servers are alive (you need absolute majority of live servers, compared to the list of servers in the configuration file of each of them) The configuration file is called 'zoo.cfg' and is located in the 'conf' subdirectory of the zookeeper setup. e.g. : /data/opt/zookeeper-3.4.10/conf/zoo.cfg
servers cannot talk to each-other :

Each server has the list of ALL other servers of the cluster in its zoo.cfg file. It will try to reach these other servers by using the address/host and port indicated on the server entry in the list. If there are name resolution or address routing or firewalling issues that prevent the communication to these ports, of course the cluster will have trouble reaching to nominal status.

To detect this, use the 'ruok' method described earlier in this document, from each of the servers toward each of the other servers.