Skip to content

Platform Metrics

Description

At the platform-level, the PunchPlatform itself publishes useful metrics to monitor its own services health. The platform monitoring is automatically started by the Shiva leader node which ensure the monitoring resilience. Make sure you have it Shiva properly installed to enable platform monitoring metrics.

By default, the metrics are forwarded to Elasticsearch and written to these indices:

  • platform-health-[YYYY.MM.DD]
  • platform-monitoring-[YYYY.MM.DD]
  • platform-monitoring-current
  • platform-logs-[YYYY.MM.DD]
  • platform-operator-logs-[YYYY.MM]

Note

a complete and detailed Standard platform and tenants events indices list can be found in Tenants configuration indices table

Metrics are periodically fetched at a fixed time interval of 10 seconds on a standalone( 60 to 90 seconds usually on production, to reduce metrics flow)

Platform Monitoring index

The platform-monitoring-* index contains the metric events coming from all the PunchPlatform services defined in the properties file. These metrics are services health status with additional useful information.

Two Elasticsearch indices are created:

  • platform-monitoring-[YYYY.MM.DD]: to store all event ordered by timestamp
  • platform-monitoring-current: to only keep the latest events (by service)

The first keyword in the name field is used as a prefix for the technology dedicated fields (e.g. elasticsearch, kafka, zookeeper, ...).

{
  "@timestamp": "2019-06-25T06:58:35.815Z",
  "health_code": 1,
  "health_name": "green",
  "platform.id": "punchplatform-primary",
  "name": "elasticsearch.cluster",
  "type": "platform",
  "elasticsearch": {
    ...
  }
}

Here are the fields shared by all metric events:

  • @timestamp (date)

    Timestamp of the event generation.

  • health_code (integer)

    Define the service health status based on a digit.

    values: 0 (unknown), 1 (green), 2 (yellow), 3 (red)

  • health_name (string)

    Define the service health status with a human readable name.

    values: unknown, green, yellow, red

  • platform.id (string)

    The platform unique identifier defined in the platform properties.

  • name (string)

    The metrics name identifier

  • type (string)

    From where the metric is coming, always set to "platform" in this case.

Platform Health index

The platform-health-* index stores an aggregate overview of the platform health. There is only one kind of document inserted, an example can be found below. The document structure has been designed to be as close as possible to the punchplatform.properties.

Note

This document is especially useful to monitor the platform with an external automated tool (like nagios), refer to the monitoring guide to learn more about it.

A new metric is added every 10 seconds with the current platform state.

{
  "@timestamp": "2019-06-10T13:00:04.300Z",
  "storm": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "main": {
        "nimbus": {
          "hosts": {
            "punch-elitebook": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        },
        "health_code": 1,
        "health_name": "green",
        "supervisor": {
          "hosts": {
            "punch-elitebook": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        }
      }
    }
  },
  "elasticsearch": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "es_search": {
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "zookeeper": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "common": {
        "health_code": 1,
        "hosts": {
          "localhost": {
            "health_code": 1,
            "health_name": "green"
          },
          "punch-elitebook": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "health_name": "green"
      }
    }
  },
  "spark": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "spark_main": {
        "health_code": 1,
        "health_name": "green",
        "worker": {
          "hosts": {
            "localhost": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        },
        "master": {
          "hosts": {
            "localhost": {
              "health_code": 1,
              "health_name": "green"
            }
          }
        }
      }
    }
  },
  "kafka": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "local": {
        "brokers": {
          "0": {
            "health_code": 1,
            "health_name": "green"
          }
        },
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "shiva": {
    "health_code": 1,
    "health_name": "green",
    "clusters": {
      "common": {
        "health_code": 1,
        "health_name": "green"
      }
    }
  },
  "platform": {
    "health_code": 1,
    "health_name": "green"
  }
}

Platform Logs index

The platform-logs-* index contains operator events and jobs lifecycle on the platform.

Here you can find a log example, there is some context information such as channel, job, user .. and a content key which contains details about performed action

{
    "content": {
      "level": "INFO",
      "message": "job started",
      "event_type": "job_start_cmd"
    },
    "target": {
      "cluster": "main",
      "type": "storm"
    },
    "init": {
      "process": {
        "name": "channelctl",
        "id": "10325@punchplatform"
      },
      "host": {
        "name": "punchplatform"
      },
      "user": {
        "name": "punch"
      }
    },
    ...
    "platform.tenant": "mytenant",
    "platform.channel": "apache_httpd",
    "platform.job": "archiving_topology",
    "platform.id": "punchplatform-primary"
  }

The key content.event_type can be use in order to construct monitoring dashboard to track lifecycle job, this key can have the following values :

  • job_start_cmd : job started by channelctl. Number of workers are also available with content.num_workers

  • job_stop_cmd : job stopped by channelctl

  • job_start_cmd_failure : failed to start job by channelctl

  • job_stop_cmd_failure : failed to stop job by channelctl

  • job_started : job started on shiva worker

  • job_running : job still running on shiva worker

  • job_ended : job ended on shiva worker

  • job_restarting : job restarting on shiva worker

  • job_failed : job failed on shiva worker

  • child_process_started : child process started

  • child_process_ended : child process ended. With info level for success and error for fail