Platform Monitoring¶

Abstract

This chapter explains how to configure the platform monitoring application.

The platform-monitoring punch application is a native application that is periodically executed to compute a synthetic platform level healh document.

For more information about the punch platform health api, please refer to Monitoring Guide.

Overview¶

The platform-monitoring application is in charge of monitoring:

Kafka
Shiva
Storm
Elasticsearch
Gateway
Zookeeper
Spark
Minio
Clickhouse

It produces platform-components monitoring documents indexed in the platform-monitoring-* Elasticsearch indices. It also compute a
synthetic platform health document indexed in the platform-health-* Elasticsearch indices.

Depending on the platform-level reporters configured in the punchplatform-deployment.settings, the documents produced by platform-monitoring are either directly written to Elasticsearch (if on a back-office) or written to a kafka topic, to be forwarded later to the back-office where a central elasticsearch back-end is used to monitor a complete fleet of platforms.

Configuration¶

The platform-monitoring application is run in shiva. Define a monitoring channel in your platform tenant. Here is the channel_structre.yaml example shipped with the standalone:

version: '6.0'
start_by_tenant: true
stop_by_tenant: true
applications:

- name: platform_health
  runtime: shiva
  cluster: common
  command: platform-monitoring
  args:
  - platform_monitoring.yaml
  - --childopts
  - -Xms256m -Xmx256m

- name: local_events_dispatcher
  runtime: shiva
  cluster: common
  command: punchlinectl
  args:
  - start
  - --punchline
  - local_events_dispatcher.yaml

- name: channels_monitoring
  runtime: shiva
  cluster: common
  command: channels-monitoring
  args:
  - channels_monitoring.yaml
  - --childopts
  - -Xms256m -Xmx256m

resources:
- type: kafka_topic
  name: platform-events
  cluster: common
  partitions: 1
  replication_factor: 1

The platform-monitoring application takes a single configuration file argument.

monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway

This sample configuration means that Kafka, Shiva, Storm, ElasticSearch, Zookeeper, Spark will all be monitored every minute.

Platform tenant

For production deployment, the platform-monitoring application should be configured inside a 'platform' tenant. This allows to setup the retention and other parameters associated to platform monitoring from other business data.

A typical production example can be found in Reference Architecture configuration examples.

Parameters¶

Mandatory¶

monitoring_interval (integer)

The interval in seconds between two health check of services.
services (string array)

The list of service we want to monitor.
values: "kafka", "shiva", "storm", "elasticsearch", "zookeeper", "spark", "minio", "gateway", "clickhouse"

Optional¶

reporters (list of reporters)

This is a list of reporters. If this field is empty or missing, the platform monitoring uses reporters configured in punchplatform_operator section in punchplatform-deployment.settings as default value

Security¶

Here is a complete example with security activated:

monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway
security:
  elasticsearch_clients:
    es_search:
      credentials:
        username: USER
        password: PASSWORD
      ssl_enabled: true
      ssl_private_key: private_key.pem
      ssl_certificate: cert.pem
      ssl_trusted_certificate: ca.pem
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
  gateway_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD
  zookeeper_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD
  kafka_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD

Try It¶

On a standalone , simply start it as follows:

channelctl --tenant platform start --channel monitoring

Monitoring rules¶

In this section, each monitoring rule for each component is detailed. If you encountered an issue with platform monitoring, you can refer to this documentation to understand your results.

Global rules¶

These rules apply to all the components described below:

Health level : Unknown > Red > Yellow > Green. As a result, when monitoring a single component (i.e storm.nimbus, spark.worker ..) the worst level always wins.
The overall health of a component (i.e Storm, Kafka ..) will be the worst health level of one of its clusters. For example : If your kafka component is defined with 3 clusters (one red and two green), your kafka health will be red.
The overall health of platform will be the worst health level of a component (i.e Kafka, Storm). For example : If all components have a green status except one red, your platform monitoring will be red
Only documents with metric name to "component.cluster" are used to construct platform health document which is used for platform monitoring dashboard. The other documents are useful to investigate or to build/enrich your own monitoring if needed

Zookeeper¶

The supervision of Zookeeper is done through these different stages :

Checking the health of zookeeper nodes
Checking the health of the zookeeper cluster thanks to the health of nodes
Checking the health of the zookeeper component thanks to the health of clusters

Zookeeper nodes monitoring¶

We use a "mntr" request on each node to determine if the node is up or not

Metric Name : zookeeper.mntr

Test	Status when test fail	Associated Alert
Connect and send a "mntr" message to zookeeper node	red	Cannot contact zookeeper node
Answer from zookeeper node contains more than one line	red	Node answers to mntr request, but cluster is down

Zookeeper cluster monitoring¶

Each alert messages from Zookeeper nodes monitoring is retrieved in this document

Metric Name : zookeeper.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
At least one zookeeper node is green and others are red	yellow
No green zookeeper nodes	red
All zookeeper nodes green	green

Kafka¶

The supervision of Kafka is done through these different stages :

Checking the health of broker nodes
Checking the health of the kafka cluster thanks to the health of brokers
Checking the health of the kafka component thanks to the health of clusters

Kafka broker monitoring¶

Metric Name : kafka.broker

Test	Status when test fail	Associated Alert
Connect to Kafka broker node	red	Can't contact the broker

Kafka cluster monitoring¶

Each alert messages from Kafka nodes monitoring is retrieved in this document

Metric Name : kafka.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
If at least one partition has no leader or an empty ISR	red
If at least one partition has an ISR list different than the replicas list	yellow
If one broker is green and others are red (with the two above rules wrong)	yellow
If all brokers are red	red

Gateway¶

The supervision of Gateway is done through these different stages :

Checking the health of gateway nodes
Checking the health of the gateway cluster thanks to the health of nodes
Checking the health of the gateway component thanks to the health of clusters

Gateway node monitoring¶

Metric Name : gateway.node

Health request : curl $gateway_port:$gateway_port/management/health

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint	red	Unable to get health or metrics from Gateway host
If request returns "UNKNOWN"	unknown	Gateway health query return a unknown status
If request does not return "UNKNOWN" or "UP" status	red	Gateway health query did not return a green status

Gateway cluster monitoring¶

Each alert messages from Gateway nodes monitoring is retrieved in this document

Metric Name : gateway.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
All nodes from cluster have a red or unknown status	red
At least one node is green and others are red or unknown	yellow

Shiva¶

The supervision of shiva is done through these different stages :

Checking the health of the shiva cluster
Checking the health of the shiva component thanks to the health of clusters

Shiva cluster monitoring¶

Metric Name : shiva.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
Unable to get Shiva health because assignment topic cannot be read for this cluster	unknown
Leader did not publish any message in assignement topic	red
Leader did not publish a message during the last 3 minutes in assignement topic	red
Leader has been elected in the last 2 minute	yellow
The unique Shiva worker is down	red
The Shiva worker is down (at least one is still green)	yellow
At least two Shiva workers are down	red

Clickhouse¶

The supervision of clickhouse is done through these different stages :

Checking the health of clickhouse nodes
Checking the health of the clickhouse shard thanks to the health of nodes
Checking the health of the clickhouse cluster thanks to the health of shard
Checking the health of the clickhouse component thanks to the health of clusters

Clickhouse node monitoring¶

Metric Name : clickhouse.node

Health request : curl $clickhouse_host:$clickhouse_port/?query="SELECT%20*%20FROM%20system.clusters%20FORMAT%20JSON"

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint	red	Unable to get health from Clickhouse HTTP for node
If health request response in not 200	unknown	HTTP response from clickhouse node is not 200

Clickhouse shard monitoring¶

Each alert messages from Clickhouse nodes monitoring is retrieved in this document

Metric Name : clickhouse.shard

Rule	Cluster status
All nodes unhealthy in shard	red
At least one node unhealthy in shard but others are green	yellow

Clickhouse cluster monitoring¶

Each alert messages from Clickhouse nodes & shard monitoring is retrieved in this document

Metric Name : clickhouse.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
At least one shard is yellow (others are green)	yellow
At least one shard is red (others are red)	red

Minio¶

The supervision of minio is done through these different stages :

Checking the health of minio nodes
Checking the health of the minio cluster thanks to the health of nodes
Checking the health of the minio component thanks to the health of clusters

Minio node monitoring¶

Metric Name : minio.node

Health request : curl $minio_host:$minio_host/minio/health/cluster

Rule	Cluster status
Response from health request is not 200	red

Minio cluster monitoring¶

Each alert messages from Minio nodes monitoring is retrieved in this document

Metric Name : minio.cluster

Rule	Cluster status
Exception during cluster monitoring	unknown
At least one node is yellow (others are green)	yellow
All nodes from cluster have a red or unknown status	red

Elasticsearch¶

The supervision of shiva is done through these different stages :

Checking the health of the elasticsearch cluster
Checking the health of the elasticsearch component thanks to the health of clusters

Elasticsearch cluster monitoring¶

Metric Name : elasticsearch.cluster

Health request : curl $es_host:$es_port/_cluster/health?format=json Node request : curl $es_host:$es_port/_cat/nodes

Rule	Cluster status
Exception during cluster monitoring	unknown
Health request on cluster returns a yellow status	yellow
Health request on cluster returns a red status	red
Health request on cluster returns a unknown status	unknown
A node is missing from Elasticsearch point of view (Node request compared to declared nodes)	yellow

Storm¶

The supervision of storm is done through these different stages :

Checking the health of storm nimbuses
Checking the health of storm supervisors
Checking the health of the storm cluster thanks to the health of nimbuses & supervisors
Checking the health of the spark component thanks to the health of clusters

Storm nimbus monitoring¶

Metric Name : storm.nimbus

Health request : curl $storm_host:$storm_port/api/v1/nimbus/summary

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint	unknown	Cannot contact nimbus API
Response from health request is not "Offline"	red	Nimbus is offline

Storm supervisor monitoring¶

Metric Name : storm.supervisor

Health request : curl $storm_host:$storm_port/api/v1/supervisor/summary

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint	unknown	Cannot contact supervisor API
Response from health request is empty	red	Supervisor is offline

Storm cluster monitoring¶

Each alert messages from Storm nimbus & supervisor monitoring is retrieved in this document

Metric Name : storm.cluster

Health request : curl $storm_host:$storm_port/api/v1/cluster/summary

Rule	Cluster status
Exception during cluster monitoring	unknown
Cluster API cannot be join	red
No slots available for cluster	yellow
At least one nimbus is green (others are unhealthy)	yellow
All nimbuses are unhealthy	red
One supervisor is down (others are green)	yellow
At least two supervisors (or all) are down	red

Spark¶

The supervision of spark is done through these different stages :

Checking the health of spark workers
Checking the health of spark masters
Checking the health of the spark cluster thanks to the health of workers & masters
Checking the health of the spark component thanks to the health of clusters

Spark masters monitoring¶

Metric Name : spark.master

Health request : curl $spark_host:$spark_port/api/master

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint or empty response	red	Master is unreachable

Spark workers monitoring¶

Metric Name : spark.worker

Health request : curl $spark_host:$spark_port/api/worker

Test	Status when test fail	Associated Alert
Fail to execute request on health endpoint or empty response	red	Worker is unreachable

Spark cluster monitoring¶

Each alert messages from Spark worker & master monitoring is retrieved in this document

Rule	Cluster status
Exception during cluster monitoring	unknown
At least one masters is unhealthy (others are green)	yellow
All masters are unhealthy	red
One worker is down (others are green)	yellow
At least two workers (or all) are down	red