Platform Metrics¶

Description¶

At the platform-level, the PunchPlatform itself publishes useful metrics to monitor its own services health. The platform monitoring is automatically started by the Shiva leader node which ensure the monitoring resilience. Make sure you have it Shiva properly installed to enable platform monitoring metrics.

By default, the metrics are forwarded to Elasticsearch and written to these indices:

platform-health-[YYYY.MM.DD]
platform-monitoring-[YYYY.MM.DD]
platform-monitoring-current
platform-logs-[YYYY.MM.DD]
platform-operator-logs-[YYYY.MM]

Note

a complete and detailed Standard platform and tenants events indices list can be found in Tenants configuration indices table

Metrics are periodically fetched at a fixed time interval of 10 seconds on a standalone( 60 to 90 seconds usually on production, to reduce metrics flow)

Platform Monitoring index¶

The platform-monitoring-* index contains the metric events coming from all the PunchPlatform services defined in the properties file. These metrics are services health status with additional useful information.

Two Elasticsearch indices are created:

platform-monitoring-[YYYY.MM.DD]: to store all event ordered by timestamp
platform-monitoring-current: to only keep the latest events (by service)

The first keyword in the name field is used as a prefix for the technology dedicated fields (e.g. elasticsearch, kafka, zookeeper, ...).

{
  "@timestamp": "2019-06-25T06:58:35.815Z",
  "health_code": 1,
  "health_name": "green",
  "platform.id": "punchplatform-primary",
  "name": "elasticsearch.cluster",
  "type": "platform",
  "elasticsearch": {
    ...
  }
}

Here are the fields shared by all metric events:

@timestamp (date)

Timestamp of the event generation.
health_code (integer)

Define the service health status based on a digit.

values: 0 (unknown), 1 (green), 2 (yellow), 3 (red)
health_name (string)

Define the service health status with a human readable name.

values: unknown, green, yellow, red
platform.id (string)

The platform unique identifier defined in the platform properties.
name (string)

The metrics name identifier
type (string)

From where the metric is coming, always set to "platform" in this case.

Platform Health index¶

The platform-health-* index stores an aggregate overview of the platform health. There is only one kind of document inserted, an example can be found below. The document structure has been designed to be as close as possible to the punchplatform.properties.

Note

This document is especially useful to monitor the platform with an external automated tool (like nagios), refer to the monitoring guide to learn more about it.

A new metric is added every 10 seconds with the current platform state.

{
  "@timestamp": "2019-06-10T13:00:04.300Z",
  "storm": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "main": {
        "nimbus": {
          "hosts": {
            "punch-elitebook": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        },
        "health_code": 1,
        "health_name": "green",
        "supervisor": {
          "hosts": {
            "punch-elitebook": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        }
      }
    }
  },
  "elasticsearch": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "es_search": {
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "zookeeper": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "common": {
        "health_code": 1,
        "hosts": {
          "localhost": {
            "health_code": 1,
            "health_name": "green"
          },
          "punch-elitebook": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "health_name": "green"
      }
    }
  },
  "spark": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "spark_main": {
        "health_code": 1,
        "health_name": "green",
        "worker": {
          "hosts": {
            "localhost": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        },
        "master": {
          "hosts": {
            "localhost": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        }
      }
    }
  },
  "kafka": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "local": {
        "brokers": {
          "0": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "shiva": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "common": {
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "platform": {
    "health_code": 1,
    "health_name": "green"
  }
}

Platform Logs index¶

The platform-logs-* index contains operator events and jobs lifecycle on the platform.

Here you can find a log example, there is some context information such as channel, job, user .. and a content key which contains details about performed action

{
    "content": {
      "level": "INFO",
      "message": "job started",
      "event_type": "job_start_cmd"
    },
    "target": {
      "cluster": "main",
      "type": "storm"
    },
    "init": {
      "process": {
        "name": "channelctl",
        "id": "10325@punchplatform"
      },
      "host": {
        "name": "punchplatform"
      },
      "user": {
        "name": "punch"
      }
    },
    ...
    "platform.tenant": "mytenant",
    "platform.channel": "apache_httpd",
    "platform.job": "archiving_topology",
    "platform.id": "punchplatform-primary"
  }

The key content.event_type can be use in order to construct monitoring dashboard to track lifecycle job, this key can have the following values :

job_start_cmd : job started by channelctl. Number of workers are also available with content.num_workers
job_stop_cmd : job stopped by channelctl
job_start_cmd_failure : failed to start job by channelctl
job_stop_cmd_failure : failed to stop job by channelctl
job_started : job started on shiva worker
job_running : job still running on shiva worker
job_ended : job ended on shiva worker
job_restarting : job restarting on shiva worker
job_failed : job failed on shiva worker
child_process_started : child process started
child_process_ended : child process ended. With info level for success and error for fail