Skip to content

PunchPlatform Supervision (by Nagios)

This chapter explains :

  • the available metrics to monitor the system performance and capacity
  • how to monitor a platform from an external supervision system, such as Nagios

Supervision

This part defines the resources to monitor in Supervision system.

Overview of resources

To ensure that PunchPlatform system is working, supervision must at least target :

  • The running status of a number of key Services (from a linux Point of view)
  • The system-level resources consumption of all PunchPlatform servers (Disk space, CPU/RAM usage)
  • The Health status indicators published by PunchPlatform admin service space,

Optionally, supervision should also target :

  • Backlog levels from Elasticsearch (Admin)
  • Supervisord error status
  • Pacemaker error status
  • Elasticsearch nodes count through REST API

Key Services to watch

On Centos servers, the supervisord system is not used, so all monitoring checks below have to be transofrmed to systemctl checks on the same services

On ubuntu PunchPlatform servers, supervision must ensure systemctl service supervisor is running.

1
sudo systemctl | grep supervisor | grep -i runnin
  • Elasticsearch servers
1
sudo supervisorctl status elasticsearch | grep RUNNING
  • Storm slaves of all clusters (LTR, LMR, STO)
1
sudo  supervisorctl status storm-supervisor | grep RUNNING
  • Storm masters of all clusters (usually on LMC and LTRs)
1
2
$ supervisorctl status storm-nimbus | grep RUNNING
$ supervisorctl status storm-ui | grep RUNNING
  • Zookeeper servers (usually on KAF and LTRs)
1
$ supervisorctl status zookeeper | grep RUNNING
  • Kafka servers
1
$ supervisorctl status kafka-<cluster_name> | grep RUNNING

Note

Take a look at punchplatform.properties or architecture documents. Usual clusters name are "front" and "back".

  • Kibana servers
1
$ supervisorctl status | grep kibana | ! grep -v RUNNING
  • Ceph servers
1
2
$ sudo systemctl | grep 'ceph-osd-main@' | ! grep -vi RUNNING
$ sudo systemctl | grep 'ceph-osd-mon@' | ! grep -vi RUNNING

On all clusters servers that use Virtual IP (LTR nodes, KIB nodes, LMC admin nodes, GRAfana nodes), supervision must ensure services pacemaker and corosync are active

1
2
$ sudo service corosync status
$ sudo service pacemaker status
  • Shiva servers

On all members of the shiva clusters, check that the shiva-runner service is active :

1
$ sudo systemctl | grep shiva-runner | grep -i running

PunchPlatform Health indicator API

To monitor the platform health using a dedicated tool (i.e. Nagios, Centreon, Zabbix, ...), the PunchPlatform exposes a JSON API. We keep an Elasticsearch resource updated with the latest platform health state.

This resources is located at /platform-health-*/_search?sort=@timestamp:desc&size=1. For example, using curl, you can fetch it with:

1
2
3
4
5
6
# The following requests sends the LAST monitoring health document, for a given platform (Back-Office or LTR)
# you can adapt the "platform.id" filter to target the platform you want the health information about
# Here, we ignore any monitoring document earlier than 15 minutes ago. So if your monitoring query
# does not return the expected "health" field, then this means the platform monitoring is not working (CRITICAL status should then be assigned to the monitoring check)

$curl -sS 'http://localhost:9200/platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:mytenant-ltr-a%20AND%20platform.health_code:>0%20AND%20@timestamp:>now-15m' 

The returned document will look like this one:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1519,
    "max_score": null,
    "hits": [
      {
        "_index": "platform-health-2019.06.10",
        "_type": "_doc",
        "_id": "0Bh5QWsBbYsFzAVtaYls",
        "_score": null,
        "_source": {
          "@timestamp": "2019-06-10T13:00:04.300Z",
          "storm": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "main": {
                "nimbus": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "health_code": 1,
                "health_name": "green",
                "supervisor": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "elasticsearch": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "es_search": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "zookeeper": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "hosts": {
                  "localhost": {
                    "health_code": 1,
                    "health_name": "green"
                  },
                  "punch-elitebook": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_name": "green"
              }
            }
          },
          "spark": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "spark_main": {
                "health_code": 1,
                "health_name": "green",
                "worker": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "master": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "kafka": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "local": {
                "brokers": {
                  "0": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "shiva": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "platform": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "sort": [
          1560171604300
        ]
      }
    ]
  }
}

If the platform monitoring is working, then the section you want to examine is under hits > the first hit > _source. For example, using the excellent Jq utility, you could get it with this command:

1
$ curl -sS -X GET '...' | jq '.hits.hits[0]._source'

At its top level, @timestamp is the last update time in ISO format. The Other fields follow the same structure as the punchplatform.properties so you will find Kafka, Storm, Elasticsearch and so on.

The "health" keys can take these values:

  • (0) unknown - the status cannot be defined (monitoring API down) ==> This is more critical than a "Red=3" status, because we don't know !
  • (1) green - everything is OK
  • (2) yellow - non nominal mode, a configuration problem is detected or some nodes are down but the service is still available
  • (3) red - critical failure, the service is down

This document represents the complete platform health. If you only need a subsection of it (let say to only monitor Elasticsearch), feel free to parse it. For example, curl works pretty well with Jq. Depending of your monitoring tool, you can fetch a string or a numeric value:

1
2
3
4
5
$ curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_name'
yellow

$ curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_code'
2

The following command-line sample displays all computed high-level monitoring indicators

1
2
3
4
5
6
7
$ curl -sS -X GET 'http://localhost:9200/platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:mytenant-ltr-a%20AND%20platform.health_code:>0%20AND%20@timestamp:>now-15m'| jq -c '.hits.hits[]._source | keys[] as $key | select(.[$key]|type=="object") | {"key":$key,"health_code":.[$key].health_code}'

{"key":"kafka","health_code":1}
{"key":"platform","health_code":1}
{"key":"shiva","health_code":1}
{"key":"storm","health_code":1}
{"key":"zookeeper","health_code":1}

Note

To learn more about Jq, a lightweight and flexible command-line JSON processor, refer to the official documentation.

Platform health computation and forwarding

  • The platform health documents are computed by the platform monitoring tasks runnning in a channel of one of the platform tenant ("platform" tenant on a back-office platform).
  • On LTR, the platform health documents are forwarded to the back-office through a metrics-forwarding channel (same as for channels metrics or platform metrics).

For information about running the platform health monitoring service, please refer to Platform monitoring task setup

(Optional) Backlog Metrics Supervision

The "backlog" is the amount of messages that are stored in a Kafka topic, but have not yet be processed by the consuming layer(s) of PunchPlatform channels.

Because the PunchPlatform inbuilt log channels health monitoring system is based on monitoring latencies inside the log channels, the backlog raising is automatically monitored for channels that have autotest latency control paths configured to encompass the kafka buffering (i.e with at least one latency control path with a start point at a spout somewhere ABOVE the kafka layer, and an end point somewhere AFTER the kafka layer). This configuration SHOULD be done for each channel, in order to have automatic monitoring of unusual backlog (which might mean unsufficient configured processing capability for this channel, or incoming messages flood on this specific channel).

Nevertheless, it is additionnaly possible to externally configure a separate (not punchplatform-provided) alerting tool to raise alerts in case of high messages backlog : by using REST request to the elasticsearch REST API (using metrics backend Virtual IP), Supervision can retrieve mean values of the backlogs of any channels that includes a Kafka spout.