Troubleshooting whole elasticsearch cluster unavailability¶

Why do that¶

The PunchPlatform admin GUI has reported RED status of an Elasticsearch cluster, and displayed alert(s) indicating cluster (or cluster REST API) unavailability.

What to do¶

If Nagios reports system failures or services failure on the cluster ==> investigate from there (overall physical failures ? Storage failure?)
If Nagios does not show a failure on cluster nodes or associated network interfaces, the problem might be at the loadbalancer level or network
- Connect through SSH on any node of the cluster
- curl <production interface IP>:9200/_cat/nodes ==> should report at least the local node as part of the cluster, and list all other nodes if the cluster is in fact working
  - If only one node was reported, use network diagnosis procedure to check network flow between production interfaces of the ES cluster nodes:
  - If all nodes are reported, then the problem is at the load-balancer level (either it is down, or the network is cut above/below the load balancers)
  - If no response to the curl ==> check system service status of elasticsearch (sudo supervisorctl status), check system resources (mount points, disk space) and try a restart of service or node. Finally, check the logs of your cluster located directly in the server at /var/log/punchplatform/elasticsearch/<cluster_name>.log for further details