Skip to content

Troubleshooting under-replication in an Elasticsearch cluster

Why do that

Because either the PunchPlatform has reported a health condition for an online index or Elasticsearch REST API has directly been used and reported a YELLOW health for an elasticsearch index.

What to do

  • Check Nagios status of other servers of same cluster (may be, one is down ==> problem found)
  • If servers of cluster all are UP, Check dashboard of servers in this ES group (to detect passed downtime or CPU/MEM overload/breakage)
  • [ Check in PP HMI to see which indexes are in danger ]
  • If needed, use ES REST API to know servers status in the cluster from ES point of view:
  • For nodes absent from cluster, check Grafana dash and Nagios status to determine if a resource is critical (disk?)
  • If all nodes are present in cluster and nothing seems wrong at system level, this may be a partial storage failure (check disk array stats for failed disks and/or failure indicators)
  • (independently of failure drilldown) Use ES REST API to track recovery process of under-replicated indexes, and ensure that in the end, all indices are back to status
  • after missing nodes have been repaired/restarted, check that these nodes are now in the elasticsearch cluster, using elasticsearch REST API (see HOWTO_check_nodes_list_and_nodes_status_in_an_elasticsearch_cluster)