Troubleshooting under-replication in an Elasticsearch cluster¶

Why do that¶

Because either the PunchPlatform has reported a health condition for an online index or Elasticsearch REST API has directly been used and reported a YELLOW health for an elasticsearch index.

What to do¶

Check Nagios status of other servers of same cluster (may be, one is down ==> problem found)
If servers of cluster all are UP, Check dashboard of servers in this ES group (to detect passed downtime or CPU/MEM overload/breakage)
[ Check in PP HMI to see which indexes are in danger ]
If needed, use ES REST API to know servers status in the cluster from ES point of view:
For nodes absent from cluster, check Grafana dash and Nagios status to determine if a resource is critical (disk?)
If all nodes are present in cluster and nothing seems wrong at system level, this may be a partial storage failure (check disk array stats for failed disks and/or failure indicators)
(independently of failure drilldown) Use ES REST API to track recovery process of under-replicated indexes, and ensure that in the end, all indices are back to status
after missing nodes have been repaired/restarted, check that these nodes are now in the elasticsearch cluster, using elasticsearch REST API (see HOWTO_check_nodes_list_and_nodes_status_in_an_elasticsearch_cluster)