Platform Metrics¶
Description¶
At the platform-level, the PunchPlatform itself publish useful metrics on its own services health. The platform monitoring is automatically started by the Shiva leader node which ensure the monitoring resilience. Ensure sure to have it Shiva properly installed to enable platform monitoring metrics.
By default, the metrics are forwarded to Elasticsearch and written to these indices:
platform-health-[YYYY.MM.DD]
platform-monitoring-[YYYY.MM.DD]
platform-monitoring-current
platform-logs-[YYYY.MM.DD]
For now, the metrics are periodically fetched at a fixed time interval of 10 seconds.
Platform Monitoring index¶
The platform-monitoring-*
index contains the metric events
coming from all the PunchPlatform services defined in the properties file.
These metrics are services health status with additional useful information.
Two Elasticsearch indices are created:
platform-monitoring-[YYYY.MM.DD]
: to store all event ordered by timestampplatform-monitoring-current
: to only keep the latest events (by service)
The first keyword in the name
field is used as a prefix for
the technology dedicated fields (e.g. elasticsearch, kafka, zookeeper, ...).
1 2 3 4 5 6 7 8 9 10 11 | { "@timestamp": "2019-06-25T06:58:35.815Z", "health_code": 1, "health_name": "green", "platform.id": "punchplatform-primary", "name": "elasticsearch.cluster", "type": "platform", "elasticsearch": { ... } } |
Here are the fields shared by all metric events:
-
@timestamp
(date)Timestamp of the event generation.
-
health_code
(integer)Define the service health status based on a digit.
values: 0 (unknown), 1 (green), 2 (yellow), 3 (red)
-
health_name
(string)Define the service health status with a human readable name.
values: unknown, green, yellow, red
-
platform.id
(string)The platform unique identifier defined in the platform properties.
-
name
(string)The metrics name identifier
-
type
(string)From where the metric is coming, always set to "platform" in this case.
Platform Health index¶
The platform-health-*
index stores an aggregate overview of the
platform health. There is only one kind of document inserted, an
example can be found below. The document structure has been designed
to be as close as possible to the punchplatform.properties
.
Note
This document is especially useful to monitor the platform with an external automated tool (like nagios), refer to the monitoring guide to learn more about it.
A new metric is added every 10 seconds with the current platform state.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | { "@timestamp": "2019-06-10T13:00:04.300Z", "storm": { "health_code": 1, "health_name": "green", "clusters": { "main": { "nimbus": { "hosts": { "punch-elitebook": { "health_code": 1, "health_name": "green" } } }, "health_code": 1, "health_name": "green", "supervisor": { "hosts": { "punch-elitebook": { "health_code": 1, "health_name": "green" } } } } } }, "elasticsearch": { "health_code": 1, "health_name": "green", "clusters": { "es_search": { "health_code": 1, "health_name": "green" } } }, "zookeeper": { "health_code": 1, "health_name": "green", "clusters": { "common": { "health_code": 1, "hosts": { "localhost": { "health_code": 1, "health_name": "green" }, "punch-elitebook": { "health_code": 1, "health_name": "green" } }, "health_name": "green" } } }, "spark": { "health_code": 1, "health_name": "green", "clusters": { "spark_main": { "health_code": 1, "health_name": "green", "worker": { "hosts": { "localhost": { "health_code": 1, "health_name": "green" } } }, "master": { "hosts": { "localhost": { "health_code": 1, "health_name": "green" } } } } } }, "kafka": { "health_code": 1, "health_name": "green", "clusters": { "local": { "brokers": { "0": { "health_code": 1, "health_name": "green" } }, "health_code": 1, "health_name": "green" } } }, "shiva": { "health_code": 1, "health_name": "green", "clusters": { "common": { "health_code": 1, "health_name": "green" } } }, "platform": { "health_code": 1, "health_name": "green" } } |
Platform Logs index¶
The platform-logs-*
index contains operator events and
jobs lifecycle on the platform.
Here you can find a log example, there is some context information such as channel, job, user .. and a content
key which
contains details about performed action
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | { "content": { "level": "INFO", "message": "job started", "event_type": "job_start_cmd" }, "target": { "cluster": "main", "type": "storm" }, "init": { "process": { "name": "punchctl", "id": "10325@punchplatform" }, "host": { "name": "punchplatform" }, "user": { "name": "punch" } }, ... "platform.tenant": "mytenant", "platform.channel": "apache_httpd", "platform.job": "archiving_topology", "platform.id": "punchplatform-primary" } |
The key content.event_type
can be use in order to construct monitoring dashboard
to track lifecycle job, this key can have the following values :
-
job_start_cmd
: job started by punchctl. Number of workers are also available withcontent.num_workers
-
job_stop_cmd
: job stopped by punchctl -
job_start_cmd_failure
: failed to start job by punchctl -
job_stop_cmd_failure
: failed to stop job by punchctl -
job_started
: job started on shiva worker -
job_running
: job still running on shiva worker -
job_ended
: job ended on shiva worker -
job_restarting
: job restarting on shiva worker -
job_failed
: job failed on shiva worker -
child_process_started
: child process started -
child_process_ended
: child process ended. With info level for success and error for fail