Skip to content

PunchPlatform Supervision (by Nagios)

This chapter explains :

  • the available metrics to monitor the system performance and capacity
  • how to monitor a platform from an external supervision system, such as Nagios

Supervision

This part defines the resources to monitor in Supervision system.

Overview of resources

To ensure that PunchPlatform system is working, supervision must at least target :

  • The running status of a number of key Services (from a linux Point of view)
  • The system-level resources consumption of all PunchPlatform servers (Disk space, CPU/RAM usage)
  • The Health status indicators published by PunchPlatform admin service space,

Optionally, supervision should also target :

  • Backlog levels from Elasticsearch (Admin)
  • Supervisord error status
  • Pacemaker error status
  • Elasticsearch nodes count through REST API

Key Services to watch

On the following PunchPlatform servers, supervision must ensure service supervisor is running :

  • Elasticsearch servers
1
$ supervisorctl status elasticsearch | grep RUNNING
  • Storm slaves of all clusters (LTR, LMR, STO)
1
$ supervisorctl status storm-supervisor | grep RUNNING
  • Storm masters of all clusters (usually on LMC and LTRs)
1
2
$ supervisorctl status storm-nimbus | grep RUNNING
$ supervisorctl status storm-ui | grep RUNNING
  • Zookeeper servers (usually on KAF and LTRs)
1
$ supervisorctl status zookeeper | grep RUNNING
  • Kafka servers
1
$ supervisorctl status kafka-<cluster_name> | grep RUNNING

Note

Take a look at punchplatform.properties or architecture documents. Usual clusters name are "front" and "back".

  • Kibana servers
1
$ supervisorctl status | grep kibana | ! grep -v RUNNING
  • Ceph servers
1
2
$ sudo systemctl | grep 'ceph-osd-main@' | ! grep -v RUNNING
$ sudo systemctl | grep 'ceph-osd-mon@' | ! grep -v RUNNING

On all clusters servers that use Virtual IP (LTR nodes, KIB nodes, LMC admin nodes, GRAfana nodes), supervision must ensure services pacemaker and corosync are active

1
2
$ sudo service corosync status
$ sudo service pacemaker status

PunchPlatform Health indicator API

To monitor the platform health using a dedicated tool (i.e. Nagios, Centreon, Zabbix, ...), the PunchPlatform exposes a JSON API. We keep an Elasticsearch resource updated with the latest platform health state.

This resources is located at /platform-health-*/_search?sort=@timestamp:desc&size=1. For example, using curl, you can fetch it with:

1
$ curl -sS -X GET 'http://localhost:9200/platform-health-*/_search?sort=@timestamp:desc&size=1'

The returned document will look like this one:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1519,
    "max_score": null,
    "hits": [
      {
        "_index": "platform-health-2019.06.10",
        "_type": "_doc",
        "_id": "0Bh5QWsBbYsFzAVtaYls",
        "_score": null,
        "_source": {
          "@timestamp": "2019-06-10T13:00:04.300Z",
          "storm": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "main": {
                "nimbus": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "health_code": 1,
                "health_name": "green",
                "supervisor": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "elasticsearch": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "es_search": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "zookeeper": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "hosts": {
                  "localhost": {
                    "health_code": 1,
                    "health_name": "green"
                  },
                  "punch-elitebook": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_name": "green"
              }
            }
          },
          "spark": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "spark_main": {
                "health_code": 1,
                "health_name": "green",
                "worker": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "master": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "kafka": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "local": {
                "brokers": {
                  "0": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "shiva": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "platform": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "sort": [
          1560171604300
        ]
      }
    ]
  }
}

The section you want to keep is under hits > the first hit > _source. For example, using the excellent Jq utility, you could get it with this command:

1
$ curl -sS -X GET '...' | jq '.hits.hits[0]._source'

At its top level, @timestamp is the last update time in ISO format. The Other fields follow the same structure as the punchplatform.properties so you will find Kafka, Storm, Elasticsearch and so on.

The "health" keys can take these values:

  • (0) unknown - the status cannot be defined (monitoring API down)
  • (1) green - everything is OK
  • (2) yellow - non nominal mode, a configuration problem is detected or some nodes are down but the service is still available
  • (3) red - critical failure, the service is down

This document represents the complete platform health. If you only need a subsection of it (let say to only monitor Elasticsearch), feel free to parse it. For example, curl works pretty well with Jq. Depending of your monitoring tool, you can fetch a string or a numeric value:

1
2
3
4
5
$ curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_name'
yellow

$ curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_code'
2

Note

To learn more about Jq, a lightweight and flexible command-line JSON processor, refer to the official documentation.

(Optional) Backlog Metrics Supervision

The "backlog" is the amount of messages that are stored in a Kafka topic, but have not yet be processed by the consuming layer(s) of PunchPlatform channels.

Because the PunchPlatform inbuilt log channels health monitoring system is based on monitoring latencies inside the log channels, the backlog raising is automatically monitored for channels that have autotest latency control paths configured to encompass the kafka buffering (i.e with at least one latency control path with a start point at a spout somewhere ABOVE the kafka layer, and an end point somewhere AFTER the kafka layer). This configuration SHOULD be done for each channel, in order to have automatic monitoring of unusual backlog (which might mean unsufficient configured processing capability for this channel, or incoming messages flood on this specific channel).

Nevertheless, it is additionnaly possible to externally configure a separate (not punchplatform-provided) alerting tool to raise alerts in case of high messages backlog : by using REST request to the elasticsearch REST API (using metrics backend Virtual IP), Supervision can retrieve mean values of the backlogs of any channels that includes a Kafka spout.