Skip to content

Platform Monitoring

Abstract

This chapter explains how to configure the platform monitoring application.

The platform-monitoring punch application is a native application that is periodically executed to compute a synthetic platform level healh document.

For more information about the punch platform health api, please refer to Monitoring Guide.

Overview

The platform-monitoring application is in charge of monitoring:

  • Kafka
  • Shiva
  • Storm
  • Elasticsearch
  • Gateway
  • Zookeeper
  • Spark
  • Minio
  • Clickhouse

It produces platform-components monitoring documents indexed in the platform-monitoring-* Elasticsearch indices. It also compute a
synthetic platform health document indexed in the platform-health-* Elasticsearch indices.

Depending on the platform-level reporters configured in the punchplatform-deployment.settings, the documents produced by platform-monitoring are either directly written to Elasticsearch (if on a back-office) or written to a kafka topic, to be forwarded later to the back-office where a central elasticsearch back-end is used to monitor a complete fleet of platforms.

Configuration

The platform-monitoring application is run in shiva. Define a monitoring channel in your platform tenant. Here is the channel_structre.yaml example shipped with the standalone:

version: '6.0'
start_by_tenant: true
stop_by_tenant: true
applications:

- name: platform_health
  runtime: shiva
  cluster: common
  command: platform-monitoring
  args:
  - platform_monitoring.yaml
  - --childopts
  - -Xms256m -Xmx256m

- name: local_events_dispatcher
  runtime: shiva
  cluster: common
  command: punchlinectl
  args:
  - start
  - --punchline
  - local_events_dispatcher.yaml

- name: channels_monitoring
  runtime: shiva
  cluster: common
  command: channels-monitoring
  args:
  - channels_monitoring.yaml
  - --childopts
  - -Xms256m -Xmx256m

resources:
- type: kafka_topic
  name: platform-events
  cluster: common
  partitions: 1
  replication_factor: 1

The platform-monitoring application takes a single configuration file argument.

monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway

This sample configuration means that Kafka, Shiva, Storm, ElasticSearch, Zookeeper, Spark will all be monitored every minute.

Platform tenant

For production deployment, the platform-monitoring application should be configured inside a 'platform' tenant. This allows to setup the retention and other parameters associated to platform monitoring from other business data.

A typical production example can be found in Reference Architecture configuration examples.

Parameters

Mandatory

  • monitoring_interval (integer)

    The interval in seconds between two health check of services.

  • services (string array)

    The list of service we want to monitor.
    values: "kafka", "shiva", "storm", "elasticsearch", "zookeeper", "spark", "minio", "gateway", "clickhouse"

Optional

Security

Here is a complete example with security activated:

monitoring_interval: 60
services:
- kafka
- shiva
- storm
- zookeeper
- elasticsearch
- clickhouse
- spark
- minio
- gateway
security:
  elasticsearch_clients:
    es_search:
      credentials:
        username: USER
        password: PASSWORD
      ssl_enabled: true
      ssl_private_key: private_key.pem
      ssl_certificate: cert.pem
      ssl_trusted_certificate: ca.pem
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
  gateway_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD
  zookeeper_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD
  kafka_clients:
    common:
      ssl_enabled: true
      ssl_truststore_location: truststore.jks
      ssl_truststore_pass: PASSWORD
      ssl_keystore_location: keystore.jks
      ssl_keystore_pass: PASSWORD

Try It

On a standalone , simply start it as follows:

channelctl --tenant platform start --channel monitoring

Monitoring rules

In this section, each monitoring rule for each component is detailed. If you encountered an issue with platform monitoring, you can refer to this documentation to understand your results.

Global rules

These rules apply to all the components described below:

  • Health level : Unknown > Red > Yellow > Green. As a result, when monitoring a single component (i.e storm.nimbus, spark.worker ..) the worst level always wins.
  • The overall health of a component (i.e Storm, Kafka ..) will be the worst health level of one of its clusters. For example : If your kafka component is defined with 3 clusters (one red and two green), your kafka health will be red.
  • The overall health of platform will be the worst health level of a component (i.e Kafka, Storm). For example : If all components have a green status except one red, your platform monitoring will be red
  • Only documents with metric name to "component.cluster" are used to construct platform health document which is used for platform monitoring dashboard. The other documents are useful to investigate or to build/enrich your own monitoring if needed

Zookeeper

The supervision of Zookeeper is done through these different stages :

  • Checking the health of zookeeper nodes
  • Checking the health of the zookeeper cluster thanks to the health of nodes
  • Checking the health of the zookeeper component thanks to the health of clusters

Zookeeper nodes monitoring

We use a "mntr" request on each node to determine if the node is up or not

Metric Name : zookeeper.mntr

Test Status when test fail Associated Alert
Connect and send a "mntr" message to zookeeper node red Cannot contact zookeeper node
Answer from zookeeper node contains more than one line red Node answers to mntr request, but cluster is down

Zookeeper cluster monitoring

Each alert messages from Zookeeper nodes monitoring is retrieved in this document

Metric Name : zookeeper.cluster

Rule Cluster status
Exception during cluster monitoring unknown
At least one zookeeper node is green and others are red yellow
No green zookeeper nodes red
All zookeeper nodes green green

Kafka

The supervision of Kafka is done through these different stages :

  • Checking the health of broker nodes
  • Checking the health of the kafka cluster thanks to the health of brokers
  • Checking the health of the kafka component thanks to the health of clusters

Kafka broker monitoring

Metric Name : kafka.broker

Test Status when test fail Associated Alert
Connect to Kafka broker node red Can't contact the broker

Kafka cluster monitoring

Each alert messages from Kafka nodes monitoring is retrieved in this document

Metric Name : kafka.cluster

Rule Cluster status
Exception during cluster monitoring unknown
If at least one partition has no leader or an empty ISR red
If at least one partition has an ISR list different than the replicas list yellow
If one broker is green and others are red (with the two above rules wrong) yellow
If all brokers are red red

Gateway

The supervision of Gateway is done through these different stages :

  • Checking the health of gateway nodes
  • Checking the health of the gateway cluster thanks to the health of nodes
  • Checking the health of the gateway component thanks to the health of clusters

Gateway node monitoring

Metric Name : gateway.node

Health request : curl $gateway_port:$gateway_port/management/health

Test Status when test fail Associated Alert
Fail to execute request on health endpoint red Unable to get health or metrics from Gateway host
If request returns "UNKNOWN" unknown Gateway health query return a unknown status
If request does not return "UNKNOWN" or "UP" status red Gateway health query did not return a green status

Gateway cluster monitoring

Each alert messages from Gateway nodes monitoring is retrieved in this document

Metric Name : gateway.cluster

Rule Cluster status
Exception during cluster monitoring unknown
All nodes from cluster have a red or unknown status red
At least one node is green and others are red or unknown yellow

Shiva

The supervision of shiva is done through these different stages :

  • Checking the health of the shiva cluster
  • Checking the health of the shiva component thanks to the health of clusters

Shiva cluster monitoring

Metric Name : shiva.cluster

Rule Cluster status
Exception during cluster monitoring unknown
Unable to get Shiva health because assignment topic cannot be read for this cluster unknown
Leader did not publish any message in assignement topic red
Leader did not publish a message during the last 3 minutes in assignement topic red
Leader has been elected in the last 2 minute yellow
The unique Shiva worker is down red
The Shiva worker is down (at least one is still green) yellow
At least two Shiva workers are down red

Clickhouse

The supervision of clickhouse is done through these different stages :

  • Checking the health of clickhouse nodes
  • Checking the health of the clickhouse shard thanks to the health of nodes
  • Checking the health of the clickhouse cluster thanks to the health of shard
  • Checking the health of the clickhouse component thanks to the health of clusters

Clickhouse node monitoring

Metric Name : clickhouse.node

Health request : curl $clickhouse_host:$clickhouse_port/?query="SELECT%20*%20FROM%20system.clusters%20FORMAT%20JSON"

Test Status when test fail Associated Alert
Fail to execute request on health endpoint red Unable to get health from Clickhouse HTTP for node
If health request response in not 200 unknown HTTP response from clickhouse node is not 200

Clickhouse shard monitoring

Each alert messages from Clickhouse nodes monitoring is retrieved in this document

Metric Name : clickhouse.shard

Rule Cluster status
All nodes unhealthy in shard red
At least one node unhealthy in shard but others are green yellow

Clickhouse cluster monitoring

Each alert messages from Clickhouse nodes & shard monitoring is retrieved in this document

Metric Name : clickhouse.cluster

Rule Cluster status
Exception during cluster monitoring unknown
At least one shard is yellow (others are green) yellow
At least one shard is red (others are red) red

Minio

The supervision of minio is done through these different stages :

  • Checking the health of minio nodes
  • Checking the health of the minio cluster thanks to the health of nodes
  • Checking the health of the minio component thanks to the health of clusters

Minio node monitoring

Metric Name : minio.node

Health request : curl $minio_host:$minio_host/minio/health/cluster

Rule Cluster status
Response from health request is not 200 red

Minio cluster monitoring

Each alert messages from Minio nodes monitoring is retrieved in this document

Metric Name : minio.cluster

Rule Cluster status
Exception during cluster monitoring unknown
At least one node is yellow (others are green) yellow
All nodes from cluster have a red or unknown status red

Elasticsearch

The supervision of shiva is done through these different stages :

  • Checking the health of the elasticsearch cluster
  • Checking the health of the elasticsearch component thanks to the health of clusters

Elasticsearch cluster monitoring

Metric Name : elasticsearch.cluster

Health request : curl $es_host:$es_port/_cluster/health?format=json Node request : curl $es_host:$es_port/_cat/nodes

Rule Cluster status
Exception during cluster monitoring unknown
Health request on cluster returns a yellow status yellow
Health request on cluster returns a red status red
Health request on cluster returns a unknown status unknown
A node is missing from Elasticsearch point of view (Node request compared to declared nodes) yellow

Storm

The supervision of storm is done through these different stages :

  • Checking the health of storm nimbuses
  • Checking the health of storm supervisors
  • Checking the health of the storm cluster thanks to the health of nimbuses & supervisors
  • Checking the health of the spark component thanks to the health of clusters

Storm nimbus monitoring

Metric Name : storm.nimbus

Health request : curl $storm_host:$storm_port/api/v1/nimbus/summary

Test Status when test fail Associated Alert
Fail to execute request on health endpoint unknown Cannot contact nimbus API
Response from health request is not "Offline" red Nimbus is offline

Storm supervisor monitoring

Metric Name : storm.supervisor

Health request : curl $storm_host:$storm_port/api/v1/supervisor/summary

Test Status when test fail Associated Alert
Fail to execute request on health endpoint unknown Cannot contact supervisor API
Response from health request is empty red Supervisor is offline

Storm cluster monitoring

Each alert messages from Storm nimbus & supervisor monitoring is retrieved in this document

Metric Name : storm.cluster

Health request : curl $storm_host:$storm_port/api/v1/cluster/summary

Rule Cluster status
Exception during cluster monitoring unknown
Cluster API cannot be join red
No slots available for cluster yellow
At least one nimbus is green (others are unhealthy) yellow
All nimbuses are unhealthy red
One supervisor is down (others are green) yellow
At least two supervisors (or all) are down red

Spark

The supervision of spark is done through these different stages :

  • Checking the health of spark workers
  • Checking the health of spark masters
  • Checking the health of the spark cluster thanks to the health of workers & masters
  • Checking the health of the spark component thanks to the health of clusters

Spark masters monitoring

Metric Name : spark.master

Health request : curl $spark_host:$spark_port/api/master

Test Status when test fail Associated Alert
Fail to execute request on health endpoint or empty response red Master is unreachable

Spark workers monitoring

Metric Name : spark.worker

Health request : curl $spark_host:$spark_port/api/worker

Test Status when test fail Associated Alert
Fail to execute request on health endpoint or empty response red Worker is unreachable

Spark cluster monitoring

Each alert messages from Spark worker & master monitoring is retrieved in this document

Rule Cluster status
Exception during cluster monitoring unknown
At least one masters is unhealthy (others are green) yellow
All masters are unhealthy red
One worker is down (others are green) yellow
At least two workers (or all) are down red