MON Training - Track2: Monitoring the channels and applications¶
Channels health vs working business-level application service¶
Have a look at the types of dashboards that a human may need to monitor application health.
Why is there need for both standard punch dashboards and custom dashboards (designed by the solution integrator or business service delivery manager).
Key points of a high-level/business-level service delivery health
This is because health of an application can be examined from bottom up
-
Are the individual technical components underlying the end-to-end application started and running ?
- individual application-specific processes (punchlines) for each part of the service (input, forwarding/receiving, parsing, indexing applications process, ... )
- platform framework components used by these processes (kafka queues daemons, Elasticsearch cluster...)
- availability of external resources used by these processes (network connection, target system for forwarding, sources of events, firewalling to external resources...)
-
Even if all of them seems to be running and stable from an os-level technical (meaning not restarting every now and then) point of view are they actually providing their end-to-end function ?
-
Is the data flowing from the entry to the end points (User HMI, Database, archiving filesystem...) ?
- Is the throughput satisfactory ? / Do we have processing lag/backlog ?
- Is the end-to-end data processing taking too long ? (i.e. processing latency from entry to end point).
- Are we using too much resources / nearing the platform maximum capacity ?
Part of the underlying platform can be monitored/troubleshooted through overall infrastructure monitoring system (out of Punchplatform scope) or through Punch Platform health monitoring tools (cf. MON training Track 1 or Platform Monitoring introduction).
The next chapters are covering the higher level of monitoring (generic and specific applications monitoring)
Generic channels application health monitoring¶
The low-level "technical" health of the application processes can be generically monitored through some generic rules, based on Punchplatform metrics. Have a look at the Channels application generic health API.
These rules are implemented by deploying the Channels Monitoring Service
Key points
-
Generic channels monitoring should be deployed, and its synthetic result should be monitored by the high-level supervision subsystem (Nagios, Zabbix, Centreon, Prometheus...)
-
Generic channels monitoring is to be configured independently per tenant
-
Channels monitoring applies only to started applications, and is only based on uptime and return codes of application workers/process. If the platform events/metrics collection and centralization into ES is not properly configured and working, then this channel monitoring will not work or not be accurate.
So the supervision system should also have supervision rules to detect that a given Punchplatform monitoring has not produced records recently (i.e. lack of recent monitoring records should trigger an alert)
- You can view channels monitoring*
Custom application metrics automatic monitoring¶
Have a look at why it is not sufficient to monitor generic metrics, and how can we monitor High-Level service.
And especially mind that Punch monitoring services should also be monitored.
Key points
-
Channels application process may be running with GREEN health level, but maybe your process is in fact stuck or too slow/lagging or encountering processing errors.
-
Solution/Platform integrator has to implement custom supervision/alerting rules. Most often
-
Monitor kafka consumers backlog to avoid data loss (detecting stuck channels applications)
-
Monitor storm.tuple.fail.m1_rate to detect replays/stuck channels application due to error conditions or misconfiguration.
-
-
Custom supervision alerting can be based on Elasticsearch REST API to query centralized punchplatform metrics.
Custom application/service dashboards¶
When a synthetic alert rises in your supervision system, a MCO operator will need to understand and pinpoint the real trouble in the whole chain, because most often multiple alerts will be present at the same time (e.g. infrastructure load alert + restarting applications at various stages + end user complaining there are no data displayed in kibana)
In a situation where multiple alerts, the root cause is often found in the lower layer (e.g. infrastructure before software framework before applications process).
But once the underlying layers are confirmed to be in good conditions, then there still remain to find out the first stage in the processing stage that is not behaving nominally, and that may explain downstream alerts.
The easiest way to do that for a human is to have a dashboard designed for a high-level view of the whole process/solution and providing key indicators (same as a supervision dashboard in an industrial facility like a production chain or a power plant).
Have a view at the Key metrics that usually need supervising.
What is a 'good' high-level custom application/service-monitoring dashboard
A typical "classical" dashboard graphs over time :
- processing backlog of all distinct retention stages (at the collector sites, before processing, before indexing, before archiving...)
- the cumulated processing rate (EPS) of all distinct stages (input at collector site, forwarding at collector site, input at central site, parsing, archiving, indexing, forwarding to/from dual site...).
- the application-specific documents indexing rate (taken from the user-data Elasticsearch indices)
- the punch generic error/attention needing metrics (tuple failures, low uptimes)
- the application-specific error metrics (e.g. : the number of badly parsed logs that have been handled and indexed in the user-data Elasticsearch indices)