ADM Training - Track3: troubleshooting basics¶
Generic troubleshooting process¶
Many kind of issues can be encountered. So not many generic 'tips' can be provided. But there are some, and they are time/effort-saving:
-
always go for a bottom-up approach
Do not try finding out why a high level component or application is not processing correctly, without checking if alert exists on underlying infrastructure or punch framework components.
-
check the health of overall context
Always check for framework error metrics (ack/fail metrics, punchline restarts) existing on the platform even though they may be related to some other component of your channels.
-
use the metrics and dashboards
When something is not working, some data source in a previous component in the overall chain is maybe the problem. Check flows and backlogs to understand the overall picture
-
BUILD DASHBOARDS IN ADVANCE
All custom business process need custom business process monitoring dashboards based on metrics (especially events flow rate at various stages, and processing backlog/lag at each stage)
When an alert occurs, you will be happy to have a nice dashboard to pinpoint quickly the problem location (infrastucture / software framework / pipelines / applications)
-
use Google search as an advanced Punchplatform documentation search engine.
e.g. try 'punchplatform kafka unavailable troubleshooting'
Platform/Component non-nominal health troubleshooting¶
When a software framework non-nominal health status appears (for example through an external system querying the platform health service REST API, the general process is
-
Check the platform infrastructure health indicator in the external monitoring system (filesystem storage usage alerts, cpu % alerts, low ram alerts, stopped/unreachable servers, and especially systemctl alerts about not-running enabled services .
-
Check the
platform health
(custom is better) dashboard for the individual health reports of the platform components. If a cluster is non-nominal, have a look at individual health reports inplatform-monitoring-*
indices, to know if a specific instance of component is not healthy -
Use the commands from ADM Health/Status commands to manually investigate the clusters health and not-nominal nodes/servers/data.
-
Look at the logs
All services of the framework have logs (usually in /var/log/punchplatform/
). These may contain warning or error events that can help understand why things are not starting up. If a service does not start normally, then the associated error may be logged by systemd itself, therefore requiring usage of
journalctl
to display the logs of the associatedsystemd unit
Channel/application non-nominal health / not working status troubleshooting¶
-
Look at the process-monitoring custom dashboards
Maybe the problem is more general (other channels affected) or caused by a failure of an upstream component.
-
First, if the non-nominal health is computed by the Channels health monitoring service then there is the possibility that the alert is a false positive, so check if other application/channels are reported as failed by the channels monitoring dashboard.
-
Check
storm.tuple.fail
metrics,uptime
metrics andkafka backlog
metrics These metrics are key to understanding where things are going wrong- is the application started ?
- is the application restarting frequently (low uptimes or no metrics)
-
View the application logs (though Shiva tasks dashboard, or following Storm troubleshooting HOWTOs)
Slow channel troubleshooting¶
-
Check for infrastructure warning alerts (cpu/ram/storage)
- You can also check ES CPU/RAM usage through _cat/nodes and _cat/allocation API
- You can also check top servers / group average of CPU/RAM/STORAGE metrics of each population of servers through Kibana metricbeat dashboards
-
Check flow rates in upstream parts of the chain
- Is there an unusual amount of incoming data or backlog data to process ?
-
Check for metrics indicating processing failure/retries
-
Are there frequent tasks restart (low uptimes) ?
-
Are there tuple failures (storm.tuple.fail metric)
-
Are there multiple indexing of the same data into Elasticsearch (see
deleted
count in cat/indices)
-
-
View the latencies/processing times metrics through kibana dashboard
-
This can help understand what component of a storm-like punchline is actually taking too much time.
-
This can help design a solution (possibly allocating more executors/cpu to this node) or understand what framework component (Elasticsearch, Kafka consumer or writer...) may have to be investigated or tuned.
-
-
Check for error documents
- In specific Kafka topics (custom channels design)
- In Elasticsearch (error indices or error documents) that can indicate unnominal handling/processing of documents in the applications or during indexing
-
View the application and framework component logs
-
The punchline may report useful warning or errors (in central Kibana for Shiva, or on storm supervisor node for Storm punchlines)
-
The associated framework components may report warning or errors in the daemon logs.
-
-
Test behaviour of the punchline in foreground mode
-
Test behaviour of the punchline or punchlets using
punchplatform-log-injector.sh
or by replaying data from an existing kafka queue.