Skip to content

Monitoring Guide

This chapter explains : - the available metrics to monitor the system performance and capacity - how to monitor a platform from an external supervision system, such as Nagios

Supervision

This part defines the resources to monitor in Supervision system.

Overview of resources

To ensure that Punchplatform system is working, supervision must at least target:

  • The running status of a number of key Systemd services
  • The system-level resources consumption of all Punchplatform servers (Disk space, CPU/RAM usage, ...)
  • The Health status indicators published by Punchplatform

Optionally, supervision should also target :

  • Backlog levels from Elasticsearch (Admin)
  • Systemd error status
  • Pacemaker error status
  • Elasticsearch nodes count through REST API

Key services to supervise

On PunchPlatform servers, supervision use Systemd monitoring system.

sudo systemctl | grep Punchplatform

elasticsearch.service                                 loaded active running   Punchplatform Elasticsearch                                                  
kafka-local.service                                   loaded active running   Punchplatform Kafka-local                                                    
storm-supervisor.service                              loaded active running   Punchplatform Storm-supervisor                                               
zookeeper.service                                     loaded active running   Punchplatform Zookeeper    

  • Ceph servers
sudo systemctl | grep 'ceph-osd-main@' | ! grep -vi RUNNING
sudo systemctl | grep 'ceph-osd-mon@' | ! grep -vi RUNNING

On all clusters servers that use Virtual IP (LTR nodes, KIB nodes, LMC admin nodes, GRAfana nodes), supervision must ensure services pacemaker and corosync are active

sudo service corosync status
sudo service pacemaker status

PunchPlatform Health indicator API

To monitor the platform health using a dedicated tool (Nagios, Centreon, Zabbix, ...), the Punchplatform exposes a JSON API. We keep an Elasticsearch resource updated with the latest platform health state.

This resources is located at /platform-health-*/_search?sort=@timestamp:desc&size=1. For example, using curl, you can fetch it with:

# The following requests sends the LAST monitoring health document, for a given platform (Back-Office or LTR)
# you can adapt the "platform.id" filter to target the platform you want the health information about
# Here, we ignore any monitoring document earlier than 15 minutes ago. So if your monitoring query
# does not return the expected "health" field, then this means the platform monitoring is not working (CRITICAL status should then be assigned to the monitoring check)

curl -sS 'http://localhost:9200/platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:mytenant-ltr-a%20AND%20platform.health_code:>0%20AND%20@timestamp:>now-15m' 

The returned document will look like this one:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1519,
    "max_score": null,
    "hits": [
      {
        "_index": "platform-health-2019.06.10",
        "_type": "_doc",
        "_id": "0Bh5QWsBbYsFzAVtaYls",
        "_score": null,
        "_source": {
          "@timestamp": "2019-06-10T13:00:04.300Z",
          "storm": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "main": {
                "nimbus": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "health_code": 1,
                "health_name": "green",
                "supervisor": {
                  "hosts": {
                    "punch-elitebook": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "elasticsearch": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "es_search": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "zookeeper": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "hosts": {
                  "localhost": {
                    "health_code": 1,
                    "health_name": "green"
                  },
                  "punch-elitebook": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_name": "green"
              }
            }
          },
          "spark": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "spark_main": {
                "health_code": 1,
                "health_name": "green",
                "worker": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                },
                "master": {
                  "hosts": {
                    "localhost": {
                      "health_code": 1,
                      "health_name": "green"
                    }
                  }
                }
              }
            }
          },
          "kafka": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "local": {
                "brokers": {
                  "0": {
                    "health_code": 1,
                    "health_name": "green"
                  }
                },
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "shiva": {
            "health_code": 1,
            "health_name": "green",
            "clusters": {
              "common": {
                "health_code": 1,
                "health_name": "green"
              }
            }
          },
          "platform": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "sort": [
          1560171604300
        ]
      }
    ]
  }
}

If the platform monitoring is working, then the section you want to examine is under hits > the first hit > _source. For example, using the excellent Jq utility, you could get it with this command:

curl -sS -X GET '...' | jq '.hits.hits[0]._source'

At its top level, @timestamp is the last update time in ISO format. The Other fields follow the same structure as the punchplatform.properties so you will find Kafka, Storm, Elasticsearch and so on.

The "health" keys can take these values:

  • (0) unknown - the status cannot be defined (monitoring API down) ==> This is more critical than a "Red=3" status, because we don't know !
  • (1) green - everything is OK
  • (2) yellow - non nominal mode, a configuration problem is detected or some nodes are down but the service is still available
  • (3) red - critical failure, the service is down

This document represents the complete platform health. If you only need a subsection of it (let say to only monitor Elasticsearch), feel free to parse it. For example, curl works pretty well with Jq. Depending of your monitoring tool, you can fetch a string or a numeric value:

curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_name'
yellow

curl -sS -X GET '...' | jq -rc '.hits.hits[0]._source.elasticsearch.health_code'
2

The following command-line sample displays all computed high-level monitoring indicators

curl -sS -X GET 'http://localhost:9200/platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:mytenant-ltr-a%20AND%20platform.health_code:>0%20AND%20@timestamp:>now-15m'| jq -c '.hits.hits[]._source | keys[] as $key | select(.[$key]|type=="object") | {"key":$key,"health_code":.[$key].health_code}'

{"key":"kafka","health_code":1}
{"key":"platform","health_code":1}
{"key":"shiva","health_code":1}
{"key":"storm","health_code":1}
{"key":"zookeeper","health_code":1}

Note

To learn more about Jq, a lightweight and flexible command-line JSON processor, refer to the official documentation.

Platform health computation and forwarding

  • The platform health documents are computed by the platform monitoring tasks running in a channel of one of the platform tenant ("platform" tenant on a back-office platform).
  • On LTR, the platform health documents are forwarded to the back-office through a metrics-forwarding channel (same as for channels metrics or platform metrics).

For information about running the platform health monitoring service, please refer to Platform monitoring task setup

(Optional) Backlog Metrics Supervision

The "backlog" is the amount of messages that are stored in a Kafka topic, but have not yet be processed by the consuming layer(s) of PunchPlatform channels.

Because the PunchPlatform inbuilt log channels health monitoring system is based on monitoring latencies inside the log channels, the backlog raising is automatically monitored for channels that have autotest latency control paths configured to encompass the kafka buffering (i.e with at least one latency control path with a start point at a spout somewhere ABOVE the kafka layer, and an end point somewhere AFTER the kafka layer). This configuration SHOULD be done for each channel, in order to have automatic monitoring of unusual backlog (which might mean unsufficient configured processing capability for this channel, or incoming messages flood on this specific channel).

Nevertheless, it is additionally possible to externally configure a separate (not punchplatform-provided) alerting tool to raise alerts in case of high messages backlog : by using REST request to the elasticsearch REST API (using metrics backend Virtual IP), Supervision can retrieve mean values of the backlogs of any channels that includes a Kafka spout.