MON Training - Track1: Monitoring the Punchplatform software framework components¶
All things to monitor, and what not to mix...¶
First let's have a look at all that should be monitored to ensure a good availability level of the business service.
Key points
To allow an appropriate summoning of incident management skills when some alert pops in the monitoring system, the monitoring checks/rules must respect some principles:
-
distinct rules for 'platform software components' and 'channels/applications' levels:
a misconfiguration of a punchplatform channel or application must not trigger 'platform' rules
-
alerts inhibition/downgrading during start phases
When an operator deploys a new configuration for a channel or application, the starting of said channels/application should not raise alert until elements are available to confirm working/non-working status.
-
criticity of alerts should reflect the status availability impact
When a node is down, or a task encounters periodic failure, but overall service is provided, then the alert should indicate it. 'Red/Unknown' status should trigger immediate response, while some 'yellow' can imply some more available time for incident management.
-
alerts should be triggered BEFORE service is impacted, when possible
Self-Monitoring by Punchplatform¶
It is often difficult to configure an external monitoring system, compliant with above principles, with many rules to check the internal working of a complex subsystem. Especially because:
-
Multiple components interact. Sometimes, this interaction is at fault, not the components health. Therefore to determine some component is working, you have to do internal health-check (API calls)
-
Cluster health rules differs from "worst of all node health"
-
Channels and Applications are custom-build (how many queues, criticity of this or that channel, criticity of transmission failures...)
-
Things change all the time
New channels in custom configuration ; new settings of channels (number of executors). Features or software updates. It is needed to limit the impact on the high-level monitoring system configuration, for all those "minor" changes
To answer all this, and to avoid too many/complex rules, Punch platform brings standard inbuilt collection and health evaluation services that can provide an aggregated view of health and alerts. See Punch monitoring for high-level automated supervision
The Platform health API¶
Have a look at the Platform Health monitoring documentation for a view of the information level available in the synthetic platform health document that you can retrieve through a REST API call to the monitoring Elasticsearch instance of your platform.
Exercise: test view your platform health
- Ensure your 'platform' tenant has an active platform monitoring application.
- Check that an elasticsearch index exists with the result of your platform health service computation
- Do a curl to your training/test platform to view its current health status.
- Stop some redundant service (e.g. stop on of your kafka nodes), wait 1 minute and check the health status
Answers
-
Check your platform_health application is active
[operator@tpadmop01 ~]$ channelctl -t platform status --channel monitoring channel:monitoring ...................................................................................................... ACTIVE application:shiva:monitoring/processing/local_events_dispatcher (tenants/platform/channels/monitoring/local_events_dispatcher) .... ACTIVE application:shiva:monitoring/processing/channels_monitoring (tenants/platform/channels/monitoring/channels_monitoring) . ACTIVE application:shiva:monitoring/processing/platform_health (tenants/platform/channels/monitoring/platform_health) ......... ACTIVE
-
Check that results are store in your monitoring elasticsearch:
[operator@tpadmop01 ~]$ curl tpesm01:9200/_cat/indices/platform-health*?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open platform-health-2020.11.17 1kNz8ccuRDChoMY0SfUPog 1 1 1440 0 457.4kb 228.6kb green open platform-health-2020.11.16 phSpLgdpRji4bpoN1cy-eg 1 1 1440 0 565.7kb 282.8kb green open platform-health-2020.10.29 WCEOIF0GThqlqlAEJE8Wvw 1 1 5760 0 2.3mb 1.1mb green open platform-health-2020.11.19 4HN7GoomRU2AbcSuzJgkzA 1 1 4205 0 1.4mb 763.8kb [...]
-
Retrieve last platform status using curl:
[operator@tpadmop01 ~]$curl -sS 'http://tpesm01:9200/ platform-health-*/_search?sort=@timestamp:desc&size=1&q=platform.id:punchplatform-training-central%20AND%20@timestamp:>now-15m' | jq .
What the API answer looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
What does it mean ?¶
For the details on the rules and 'colors' meaning, please refer to Platform monitoring rules.
Key points
-
Yellow usually means your service is still delivered somehow, but probably with some quality loss, which can be:
- High Availability / data replication level reduction.
- Periodic errors/restarts with periodic service unavailability during restarts of processes
- Performance reduction through missing nodes
-
So Yellow of course does not mean 'do nothing until it turn reds'
==> standard monitoring process upon 'yellow' alerts is to scan all available indicators and dashboards to see if service is actually delivered, and if platform stability is at risk (e.g. resources usage evolving towards total unavailability).
-
a platform
Green
does not mean your service is beeing delivered! You still have layers on top of the platform framework (applications, queues, networking, custom processing...). See track 2...
And for humans checking/investigation¶
Have a look at the standard dashboards for platform health and framework components status monitoring.