Platform Metrics¶
Description¶
At the platform-level, the PunchPlatform itself publishes useful metrics to monitor its own services health. The platform monitoring is automatically started by the Shiva leader node which ensure the monitoring resilience. Make sure you have it Shiva properly installed to enable platform monitoring metrics.
By default, the metrics are forwarded to Elasticsearch and written to these indices:
platform-health-[YYYY.MM.DD]
platform-monitoring-[YYYY.MM.DD]
platform-monitoring-current
platform-logs-[YYYY.MM.DD]
platform-operator-logs-[YYYY.MM]
Note
a complete and detailed Standard platform and tenants events indices list can be found in Tenants configuration indices table
Metrics are periodically fetched at a fixed time interval of 10 seconds on a standalone( 60 to 90 seconds usually on production, to reduce metrics flow)
Platform Monitoring index¶
The platform-monitoring-*
index contains the metric events
coming from all the PunchPlatform services defined in the properties file.
These metrics are services health status with additional useful information.
Two Elasticsearch indices are created:
platform-monitoring-[YYYY.MM.DD]
: to store all event ordered by timestampplatform-monitoring-current
: to only keep the latest events (by service)
The first keyword in the name
field is used as a prefix for
the technology dedicated fields (e.g. elasticsearch, kafka, zookeeper, ...).
{
"@timestamp": "2019-06-25T06:58:35.815Z",
"health_code": 1,
"health_name": "green",
"platform.id": "punchplatform-primary",
"name": "elasticsearch.cluster",
"type": "platform",
"elasticsearch": {
...
}
}
Here are the fields shared by all metric events:
-
@timestamp
(date)Timestamp of the event generation.
-
health_code
(integer)Define the service health status based on a digit.
values: 0 (unknown), 1 (green), 2 (yellow), 3 (red)
-
health_name
(string)Define the service health status with a human readable name.
values: unknown, green, yellow, red
-
platform.id
(string)The platform unique identifier defined in the platform properties.
-
name
(string)The metrics name identifier
-
type
(string)From where the metric is coming, always set to "platform" in this case.
Platform Health index¶
The platform-health-*
index stores an aggregate overview of the
platform health. There is only one kind of document inserted, an
example can be found below. The document structure has been designed
to be as close as possible to the punchplatform.properties
.
Note
This document is especially useful to monitor the platform with an external automated tool (like nagios), refer to the monitoring guide to learn more about it.
A new metric is added every 10 seconds with the current platform state.
{
"@timestamp": "2019-06-10T13:00:04.300Z",
"storm": {
"health_code": 1,
"health_name": "green",
"clusters": {
"main": {
"nimbus": {
"hosts": {
"punch-elitebook": {
"health_code": 1,
"health_name": "green"
}
}
},
"health_code": 1,
"health_name": "green",
"supervisor": {
"hosts": {
"punch-elitebook": {
"health_code": 1,
"health_name": "green"
}
}
}
}
}
},
"elasticsearch": {
"health_code": 1,
"health_name": "green",
"clusters": {
"es_search": {
"health_code": 1,
"health_name": "green"
}
}
},
"zookeeper": {
"health_code": 1,
"health_name": "green",
"clusters": {
"common": {
"health_code": 1,
"hosts": {
"localhost": {
"health_code": 1,
"health_name": "green"
},
"punch-elitebook": {
"health_code": 1,
"health_name": "green"
}
},
"health_name": "green"
}
}
},
"spark": {
"health_code": 1,
"health_name": "green",
"clusters": {
"spark_main": {
"health_code": 1,
"health_name": "green",
"worker": {
"hosts": {
"localhost": {
"health_code": 1,
"health_name": "green"
}
}
},
"master": {
"hosts": {
"localhost": {
"health_code": 1,
"health_name": "green"
}
}
}
}
}
},
"kafka": {
"health_code": 1,
"health_name": "green",
"clusters": {
"local": {
"brokers": {
"0": {
"health_code": 1,
"health_name": "green"
}
},
"health_code": 1,
"health_name": "green"
}
}
},
"shiva": {
"health_code": 1,
"health_name": "green",
"clusters": {
"common": {
"health_code": 1,
"health_name": "green"
}
}
},
"platform": {
"health_code": 1,
"health_name": "green"
}
}
Platform Logs index¶
The platform-logs-*
index contains operator events and
jobs lifecycle on the platform.
Here you can find a log example, there is some context information such as channel, job, user .. and a content
key which
contains details about performed action
{
"content": {
"level": "INFO",
"message": "job started",
"event_type": "job_start_cmd"
},
"target": {
"cluster": "main",
"type": "storm"
},
"init": {
"process": {
"name": "channelctl",
"id": "10325@punchplatform"
},
"host": {
"name": "punchplatform"
},
"user": {
"name": "punch"
}
},
...
"platform.tenant": "mytenant",
"platform.channel": "apache_httpd",
"platform.job": "archiving_topology",
"platform.id": "punchplatform-primary"
}
The key content.event_type
can be use in order to construct monitoring dashboard
to track lifecycle job, this key can have the following values :
-
job_start_cmd
: job started by channelctl. Number of workers are also available withcontent.num_workers
-
job_stop_cmd
: job stopped by channelctl -
job_start_cmd_failure
: failed to start job by channelctl -
job_stop_cmd_failure
: failed to stop job by channelctl -
job_started
: job started on shiva worker -
job_running
: job still running on shiva worker -
job_ended
: job ended on shiva worker -
job_restarting
: job restarting on shiva worker -
job_failed
: job failed on shiva worker -
child_process_started
: child process started -
child_process_ended
: child process ended. With info level for success and error for fail