Platform Monitoring¶
Abstract
This chapter explains how to configure the platform monitoring application.
The platform-monitoring
punch application is a native application that is periodically executed to compute a synthetic platform level healh document.
For more information about the punch platform health api, please refer to Monitoring Guide.
Overview¶
The platform-monitoring
application is in charge of monitoring:
Kafka
Shiva
Storm
Elasticsearch
Gateway
Zookeeper
Spark
Minio
Clickhouse
It produces platform-components monitoring documents indexed in the
platform-monitoring-*
Elasticsearch indices. It also compute a
synthetic platform health document indexed in the platform-health-*
Elasticsearch indices.
Depending on the platform-level reporters configured in the punchplatform-deployment.settings,
the documents produced by platform-monitoring
are either directly written to Elasticsearch (if on a back-office) or written to a kafka topic,
to be forwarded later to the back-office where a central elasticsearch back-end
is used to monitor a complete fleet of platforms.
Configuration¶
The platform-monitoring
application
is run in shiva. Define a monitoring channel in your platform
tenant.
Here is the channel_structre.yaml example shipped with the standalone:
version: '6.0'
start_by_tenant: true
stop_by_tenant: true
applications:
- name: platform_health
runtime: shiva
cluster: common
command: platform-monitoring
args:
- platform_monitoring.yaml
- --childopts
- -Xms256m -Xmx256m
- name: local_events_dispatcher
runtime: shiva
cluster: common
command: punchlinectl
args:
- start
- --punchline
- local_events_dispatcher.yaml
- name: channels_monitoring
runtime: shiva
cluster: common
command: channels-monitoring
args:
- channels_monitoring.yaml
- --childopts
- -Xms256m -Xmx256m
resources:
- type: kafka_topic
name: platform-events
cluster: common
partitions: 1
replication_factor: 1
The platform-monitoring
application takes a single configuration file argument.
monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway
This sample configuration means that Kafka, Shiva, Storm, ElasticSearch, Zookeeper, Spark will all be monitored every minute.
Platform tenant
For production deployment, the platform-monitoring
application should be configured inside
a 'platform' tenant. This allows to setup the retention and other parameters
associated to platform monitoring from other business data.
A typical production example can be found in Reference Architecture configuration examples.
Parameters¶
Mandatory¶
-
monitoring_interval
(integer)The interval in seconds between two health check of services.
-
services
(string array)The list of service we want to monitor.
values:"kafka"
,"shiva"
,"storm"
,"elasticsearch"
,"zookeeper"
,"spark"
,"minio"
,"gateway"
,"clickhouse"
Optional¶
reporters
(list of reporters)This is a list of reporters. If this field is empty or missing, the platform monitoring uses reporters configured in punchplatform_operator section in punchplatform-deployment.settings as default value
Security¶
Here is a complete example with security activated:
monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway
security:
elasticsearch_clients:
es_search:
credentials:
username: USER
password: PASSWORD
ssl_enabled: true
ssl_private_key: private_key.pem
ssl_certificate: cert.pem
ssl_trusted_certificate: ca.pem
ssl_truststore_location: truststore.jks
ssl_truststore_pass: PASSWORD
gateway_clients:
common:
ssl_enabled: true
ssl_truststore_location: truststore.jks
ssl_truststore_pass: PASSWORD
ssl_keystore_location: keystore.jks
ssl_keystore_pass: PASSWORD
zookeeper_clients:
common:
ssl_enabled: true
ssl_truststore_location: truststore.jks
ssl_truststore_pass: PASSWORD
ssl_keystore_location: keystore.jks
ssl_keystore_pass: PASSWORD
kafka_clients:
common:
ssl_enabled: true
ssl_truststore_location: truststore.jks
ssl_truststore_pass: PASSWORD
ssl_keystore_location: keystore.jks
ssl_keystore_pass: PASSWORD
Try It¶
On a standalone , simply start it as follows:
channelctl --tenant platform start --channel monitoring
Monitoring rules¶
In this section, each monitoring rule for each component is detailed. If you encountered an issue with platform monitoring, you can refer to this documentation to understand your results.
Global rules¶
These rules apply to all the components described below:
- Health level : Unknown > Red > Yellow > Green. As a result, when monitoring a single component (i.e storm.nimbus, spark.worker ..) the worst level always wins.
- The overall health of a component (i.e Storm, Kafka ..) will be the worst health level of one of its clusters. For example : If your kafka component is defined with 3 clusters (one red and two green), your kafka health will be red.
- The overall health of platform will be the worst health level of a component (i.e Kafka, Storm). For example : If all components have a green status except one red, your platform monitoring will be red
- Only documents with metric name to "component.cluster" are used to construct platform health document which is used for platform monitoring dashboard. The other documents are useful to investigate or to build/enrich your own monitoring if needed
Zookeeper¶
The supervision of Zookeeper is done through these different stages :
- Checking the health of zookeeper nodes
- Checking the health of the zookeeper cluster thanks to the health of nodes
- Checking the health of the zookeeper component thanks to the health of clusters
Zookeeper nodes monitoring¶
We use a "mntr" request on each node to determine if the node is up or not
Metric Name : zookeeper.mntr
Test | Status when test fail | Associated Alert |
---|---|---|
Connect and send a "mntr" message to zookeeper node | red | Cannot contact zookeeper node |
Answer from zookeeper node contains more than one line | red | Node answers to mntr request, but cluster is down |
Zookeeper cluster monitoring¶
Each alert messages from Zookeeper nodes monitoring is retrieved in this document
Metric Name : zookeeper.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
At least one zookeeper node is green and others are red | yellow |
No green zookeeper nodes | red |
All zookeeper nodes green | green |
Kafka¶
The supervision of Kafka is done through these different stages :
- Checking the health of broker nodes
- Checking the health of the kafka cluster thanks to the health of brokers
- Checking the health of the kafka component thanks to the health of clusters
Kafka broker monitoring¶
Metric Name : kafka.broker
Test | Status when test fail | Associated Alert |
---|---|---|
Connect to Kafka broker node | red | Can't contact the broker |
Kafka cluster monitoring¶
Each alert messages from Kafka nodes monitoring is retrieved in this document
Metric Name : kafka.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
If at least one partition has no leader or an empty ISR | red |
If at least one partition has an ISR list different than the replicas list | yellow |
If one broker is green and others are red (with the two above rules wrong) | yellow |
If all brokers are red | red |
Gateway¶
The supervision of Gateway is done through these different stages :
- Checking the health of gateway nodes
- Checking the health of the gateway cluster thanks to the health of nodes
- Checking the health of the gateway component thanks to the health of clusters
Gateway node monitoring¶
Metric Name : gateway.node
Health request : curl $gateway_port:$gateway_port/management/health
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint | red | Unable to get health or metrics from Gateway host |
If request returns "UNKNOWN" | unknown | Gateway health query return a unknown status |
If request does not return "UNKNOWN" or "UP" status | red | Gateway health query did not return a green status |
Gateway cluster monitoring¶
Each alert messages from Gateway nodes monitoring is retrieved in this document
Metric Name : gateway.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
All nodes from cluster have a red or unknown status | red |
At least one node is green and others are red or unknown | yellow |
Shiva¶
The supervision of shiva is done through these different stages :
- Checking the health of the shiva cluster
- Checking the health of the shiva component thanks to the health of clusters
Shiva cluster monitoring¶
Metric Name : shiva.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
Unable to get Shiva health because assignment topic cannot be read for this cluster | unknown |
Leader did not publish any message in assignement topic | red |
Leader did not publish a message during the last 3 minutes in assignement topic | red |
Leader has been elected in the last 2 minute | yellow |
The unique Shiva worker is down | red |
The Shiva worker is down (at least one is still green) | yellow |
At least two Shiva workers are down | red |
Clickhouse¶
The supervision of clickhouse is done through these different stages :
- Checking the health of clickhouse nodes
- Checking the health of the clickhouse shard thanks to the health of nodes
- Checking the health of the clickhouse cluster thanks to the health of shard
- Checking the health of the clickhouse component thanks to the health of clusters
Clickhouse node monitoring¶
Metric Name : clickhouse.node
Health request : curl $clickhouse_host:$clickhouse_port/?query="SELECT%20*%20FROM%20system.clusters%20FORMAT%20JSON"
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint | red | Unable to get health from Clickhouse HTTP for node |
If health request response in not 200 | unknown | HTTP response from clickhouse node is not 200 |
Clickhouse shard monitoring¶
Each alert messages from Clickhouse nodes monitoring is retrieved in this document
Metric Name : clickhouse.shard
Rule | Cluster status |
---|---|
All nodes unhealthy in shard | red |
At least one node unhealthy in shard but others are green | yellow |
Clickhouse cluster monitoring¶
Each alert messages from Clickhouse nodes & shard monitoring is retrieved in this document
Metric Name : clickhouse.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
At least one shard is yellow (others are green) | yellow |
At least one shard is red (others are red) | red |
Minio¶
The supervision of minio is done through these different stages :
- Checking the health of minio nodes
- Checking the health of the minio cluster thanks to the health of nodes
- Checking the health of the minio component thanks to the health of clusters
Minio node monitoring¶
Metric Name : minio.node
Health request : curl $minio_host:$minio_host/minio/health/cluster
Rule | Cluster status |
---|---|
Response from health request is not 200 | red |
Minio cluster monitoring¶
Each alert messages from Minio nodes monitoring is retrieved in this document
Metric Name : minio.cluster
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
At least one node is yellow (others are green) | yellow |
All nodes from cluster have a red or unknown status | red |
Elasticsearch¶
The supervision of shiva is done through these different stages :
- Checking the health of the elasticsearch cluster
- Checking the health of the elasticsearch component thanks to the health of clusters
Elasticsearch cluster monitoring¶
Metric Name : elasticsearch.cluster
Health request : curl $es_host:$es_port/_cluster/health?format=json Node request : curl $es_host:$es_port/_cat/nodes
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
Health request on cluster returns a yellow status | yellow |
Health request on cluster returns a red status | red |
Health request on cluster returns a unknown status | unknown |
A node is missing from Elasticsearch point of view (Node request compared to declared nodes) | yellow |
Storm¶
The supervision of storm is done through these different stages :
- Checking the health of storm nimbuses
- Checking the health of storm supervisors
- Checking the health of the storm cluster thanks to the health of nimbuses & supervisors
- Checking the health of the spark component thanks to the health of clusters
Storm nimbus monitoring¶
Metric Name : storm.nimbus
Health request : curl $storm_host:$storm_port/api/v1/nimbus/summary
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint | unknown | Cannot contact nimbus API |
Response from health request is not "Offline" | red | Nimbus is offline |
Storm supervisor monitoring¶
Metric Name : storm.supervisor
Health request : curl $storm_host:$storm_port/api/v1/supervisor/summary
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint | unknown | Cannot contact supervisor API |
Response from health request is empty | red | Supervisor is offline |
Storm cluster monitoring¶
Each alert messages from Storm nimbus & supervisor monitoring is retrieved in this document
Metric Name : storm.cluster
Health request : curl $storm_host:$storm_port/api/v1/cluster/summary
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
Cluster API cannot be join | red |
No slots available for cluster | yellow |
At least one nimbus is green (others are unhealthy) | yellow |
All nimbuses are unhealthy | red |
One supervisor is down (others are green) | yellow |
At least two supervisors (or all) are down | red |
Spark¶
The supervision of spark is done through these different stages :
- Checking the health of spark workers
- Checking the health of spark masters
- Checking the health of the spark cluster thanks to the health of workers & masters
- Checking the health of the spark component thanks to the health of clusters
Spark masters monitoring¶
Metric Name : spark.master
Health request : curl $spark_host:$spark_port/api/master
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint or empty response | red | Master is unreachable |
Spark workers monitoring¶
Metric Name : spark.worker
Health request : curl $spark_host:$spark_port/api/worker
Test | Status when test fail | Associated Alert |
---|---|---|
Fail to execute request on health endpoint or empty response | red | Worker is unreachable |
Spark cluster monitoring¶
Each alert messages from Spark worker & master monitoring is retrieved in this document
Rule | Cluster status |
---|---|
Exception during cluster monitoring | unknown |
At least one masters is unhealthy (others are green) | yellow |
All masters are unhealthy | red |
One worker is down (others are green) | yellow |
At least two workers (or all) are down | red |