Troubleshooting under-replication in an Elasticsearch cluster¶
Why do that¶
Because either the PunchPlatform has reported a health condition for an online index or Elasticsearch REST API has directly been used and reported a YELLOW health for an elasticsearch index.
What to do¶
- Check Nagios status of other servers of same cluster (may be, one is down ==> problem found)
- If servers of cluster all are UP, Check dashboard of servers in this ES group (to detect passed downtime or CPU/MEM overload/breakage)
- [ Check in PP HMI to see which indexes are in danger ]
- If needed, use ES REST API to know servers status in the cluster from ES point of view:
- For nodes absent from cluster, check Grafana dash and Nagios status to determine if a resource is critical (disk?)
- If all nodes are present in cluster and nothing seems wrong at system level, this may be a partial storage failure (check disk array stats for failed disks and/or failure indicators)
- (independently of failure drilldown) Use ES REST API to track recovery process of under-replicated indexes, and ensure that in the end, all indices are back to status
- after missing nodes have been repaired/restarted, check that these nodes are now in the elasticsearch cluster, using elasticsearch REST API (see HOWTO_check_nodes_list_and_nodes_status_in_an_elasticsearch_cluster)