MON Training - Track1: Monitoring the Punchplatform software framework components¶

All things to monitor, and what not to mix...¶

First let's have a look at all that should be monitored to ensure a good availability level of the business service.

Key points

To allow an appropriate summoning of incident management skills when some alert pops in the monitoring system, the monitoring checks/rules must respect some principles:

distinct rules for 'platform software components' and 'channels/applications' levels:

a misconfiguration of a punchplatform channel or application must not trigger 'platform' rules
alerts inhibition/downgrading during start phases

When an operator deploys a new configuration for a channel or application, the starting of said channels/application should not raise alert until elements are available to confirm working/non-working status.
criticity of alerts should reflect the status availability impact

When a node is down, or a task encounters periodic failure, but overall service is provided, then the alert should indicate it. 'Red/Unknown' status should trigger immediate response, while some 'yellow' can imply some more available time for incident management.
alerts should be triggered BEFORE service is impacted, when possible

Self-Monitoring by Punchplatform¶

It is often difficult to configure an external monitoring system, compliant with above principles, with many rules to check the internal working of a complex subsystem. Especially because:

Multiple components interact. Sometimes, this interaction is at fault, not the components health. Therefore to determine some component is working, you have to do internal health-check (API calls)
Cluster health rules differs from "worst of all node health"
Channels and Applications are custom-build (how many queues, criticity of this or that channel, criticity of transmission failures...)
Things change all the time

New channels in custom configuration ; new settings of channels (number of executors). Features or software updates. It is needed to limit the impact on the high-level monitoring system configuration, for all those "minor" changes

To answer all this, and to avoid too many/complex rules, Punch platform brings standard inbuilt collection and health evaluation services that can provide an aggregated view of health and alerts. See Punch monitoring for high-level automated supervision

The Platform health API¶

Have a look at the Platform Health monitoring documentation for a view of the information level available in the synthetic platform health document that you can retrieve through a REST API call to the monitoring Elasticsearch instance of your platform.

Exercise: test view your platform health

Ensure your 'platform' tenant has an active platform monitoring application.
Check that an elasticsearch index exists with the result of your platform health service computation
Do a curl to your training/test platform to view its current health status.
Stop some redundant service (e.g. stop on of your kafka nodes), wait 1 minute and check the health status

Answers

Check your platform_health application is active

[operator@tpadmop01 ~]$ channelctl -t platform status --channel monitoring

      channel:monitoring ...................................................................................................... ACTIVE
      application:shiva:monitoring/processing/local_events_dispatcher  (tenants/platform/channels/monitoring/local_events_dispatcher) .... ACTIVE
      application:shiva:monitoring/processing/channels_monitoring  (tenants/platform/channels/monitoring/channels_monitoring) . ACTIVE
      application:shiva:monitoring/processing/platform_health  (tenants/platform/channels/monitoring/platform_health) ......... ACTIVE

Check that results are store in your monitoring elasticsearch:

[operator@tpadmop01 ~]$ curl tpesm01:9200/_cat/indices/platform-health*?v 

    health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    green  open   platform-health-2020.11.17 1kNz8ccuRDChoMY0SfUPog   1   1       1440            0    457.4kb        228.6kb
    green  open   platform-health-2020.11.16 phSpLgdpRji4bpoN1cy-eg   1   1       1440            0    565.7kb        282.8kb
    green  open   platform-health-2020.10.29 WCEOIF0GThqlqlAEJE8Wvw   1   1       5760            0      2.3mb          1.1mb
    green  open   platform-health-2020.11.19 4HN7GoomRU2AbcSuzJgkzA   1   1       4205            0      1.4mb        763.8kb
    [...]

Retrieve last platform status using curl:

[operator@tpadmop01 ~]$curl -sS 'http://tpesm01:9200/   platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:punchplatform-training-central%20AND%20@timestamp:>now-15m'  | jq .

What the API answer looks like

        {
          "took": 13,
          "timed_out": false,
          "_shards": {
            "total": 34,
            "successful": 34,
            "skipped": 0,
            "failed": 0
          },
          "hits": {
            "total": {
              "value": 28,
              "relation": "eq"
            },
            "max_score": null,
            "hits": [
              {
                "_index": "platform-health-2020.11.30",
                "_type": "_doc",
                "_id": "OyVmGnYBq30WgZ0wAH-x",
                "_score": null,
                "_source": {
                  "@timestamp": "2020-11-30T18:24:21.150Z",
                  "zookeeper": {
                    "health_code": 1,
                    "health_name": "green",
                    "clusters": {
                      "zkf": {
                        "health_code": 1,
                        "health_name": "green"
                      },
                      "zkm": {
                        "health_code": 1,
                        "health_name": "green"
                      }
                    }
                  },
                  "elasticsearch": {
                    "health_code": 2,
                    "health_name": "yellow",
                    "clusters": {
                      "es_data": {
                        "alert_messages": [
                          "These nodes : [tpesddat02_es_data] are not seen by the Elasticsearch cluster"
                        ],
                        "health_code": 2,
                        "health_name": "yellow"
                      },
                      "es_monitoring": {
                        "health_code": 1,
                        "health_name": "green"
                      }
                    }
                  },
                  "kafka": {
                    "health_code": 1,
                    "health_name": "green",
                    "clusters": {
                      "back": {
                        "health_code": 1,
                        "health_name": "green"
                      },
                      "front": {
                        "health_code": 1,
                        "health_name": "green"
                      }
                    }
                  },
                  "shiva": {
                    "health_code": 1,
                    "health_name": "green",
                    "clusters": {
                      "processing": {
                        "health_code": 1,
                        "health_name": "green"
                      },
                      "front_shiva": {
                        "health_code": 1,
                        "health_name": "green"
                      }
                    }
                  },
                  "type": "platform-health",
                  "platform": {
                    "health_code": 2,
                    "health_name": "yellow",
                    "id": "punchplatform-training-central"
                  },
                  "gateway": {
                    "health_code": 1,
                    "health_name": "green",
                    "clusters": {
                      "mycluster": {
                        "health_code": 1,
                        "health_name": "green"
                      }
                    }
                  }
                },
                "sort": [
                  1606760661150
                ]
              }
            ]
          }
        }

What does it mean ?¶

For the details on the rules and 'colors' meaning, please refer to Platform monitoring rules.

Key points

Yellow usually means your service is still delivered somehow, but probably with some quality loss, which can be:
- High Availability / data replication level reduction.
- Periodic errors/restarts with periodic service unavailability during restarts of processes
- Performance reduction through missing nodes
So Yellow of course does not mean 'do nothing until it turn reds'

==> standard monitoring process upon 'yellow' alerts is to scan all available indicators and dashboards to see if service is actually delivered, and if platform stability is at risk (e.g. resources usage evolving towards total unavailability).
a platform Green does not mean your service is beeing delivered! You still have layers on top of the platform framework (applications, queues, networking, custom processing...). See track 2...

And for humans checking/investigation¶

Have a look at the standard dashboards for platform health and framework components status monitoring.