Skip to content

ADM Training - Track3: troubleshooting basics

Generic troubleshooting process

Many kind of issues can be encountered. So not many generic 'tips' can be provided. But there are some, and they are time/effort-saving:

  • always go for a bottom-up approach

    Do not try finding out why a high level component or application is not processing correctly, without checking if alert exists on underlying infrastructure or punch framework components.

  • check the health of overall context

    Always check for framework error metrics (ack/fail metrics, punchline restarts) existing on the platform even though they may be related to some other component of your channels.

  • use the metrics and dashboards

    When something is not working, some data source in a previous component in the overall chain is maybe the problem. Check flows and backlogs to understand the overall picture

  • BUILD DASHBOARDS IN ADVANCE

    All custom business process need custom business process monitoring dashboards based on metrics (especially events flow rate at various stages, and processing backlog/lag at each stage)

    When an alert occurs, you will be happy to have a nice dashboard to pinpoint quickly the problem location (infrastucture / software framework / pipelines / applications)

  • use Google search as an advanced Punchplatform documentation search engine.

    e.g. try 'punchplatform kafka unavailable troubleshooting'

Platform/Component non-nominal health troubleshooting

When a software framework non-nominal health status appears (for example through an external system querying the platform health service REST API, the general process is

  1. Check the platform infrastructure health indicator in the external monitoring system (filesystem storage usage alerts, cpu % alerts, low ram alerts, stopped/unreachable servers, and especially systemctl alerts about not-running enabled services .

  2. Check the platform health (custom is better) dashboard for the individual health reports of the platform components. If a cluster is non-nominal, have a look at individual health reports in platform-monitoring-* indices, to know if a specific instance of component is not healthy

  3. Use the commands from ADM Health/Status commands to manually investigate the clusters health and not-nominal nodes/servers/data.

  4. Look at the logs

    All services of the framework have logs (usually in /var/log/punchplatform/). These may contain warning or error events that can help understand why things are not starting up.

    If a service does not start normally, then the associated error may be logged by systemd itself, therefore requiring usage of journalctl to display the logs of the associated systemd unit

Channel/application non-nominal health / not working status troubleshooting

  1. Look at the process-monitoring custom dashboards

    Maybe the problem is more general (other channels affected) or caused by a failure of an upstream component.

  2. First, if the non-nominal health is computed by the Channels health monitoring service then there is the possibility that the alert is a false positive, so check if other application/channels are reported as failed by the channels monitoring dashboard.

  3. Check storm.tuple.fail metrics, uptime metrics and kafka backlog metrics These metrics are key to understanding where things are going wrong

    • is the application started ?
    • is the application restarting frequently (low uptimes or no metrics)
  4. View the application logs (though Shiva tasks dashboard, or following Storm troubleshooting HOWTOs)

Slow channel troubleshooting

  1. Check for infrastructure warning alerts (cpu/ram/storage)

    • You can also check ES CPU/RAM usage through _cat/nodes and _cat/allocation API
    • You can also check top servers / group average of CPU/RAM/STORAGE metrics of each population of servers through Kibana metricbeat dashboards
  2. Check flow rates in upstream parts of the chain

    • Is there an unusual amount of incoming data or backlog data to process ?
  3. Check for metrics indicating processing failure/retries

    • Are there frequent tasks restart (low uptimes) ?

    • Are there tuple failures (storm.tuple.fail metric)

    • Are there multiple indexing of the same data into Elasticsearch (see deleted count in cat/indices)

  4. View the latencies/processing times metrics through kibana dashboard

    • This can help understand what component of a storm-like punchline is actually taking too much time.

    • This can help design a solution (possibly allocating more executors/cpu to this node) or understand what framework component (Elasticsearch, Kafka consumer or writer...) may have to be investigated or tuned.

  5. Check for error documents

    • In specific Kafka topics (custom channels design)
    • In Elasticsearch (error indices or error documents) that can indicate unnominal handling/processing of documents in the applications or during indexing
  6. View the application and framework component logs

    • The punchline may report useful warning or errors (in central Kibana for Shiva, or on storm supervisor node for Storm punchlines)

    • The associated framework components may report warning or errors in the daemon logs.

  7. Test behaviour of the punchline in foreground mode

  8. Test behaviour of the punchline or punchlets using punchplatform-log-injector.sh or by replaying data from an existing kafka queue.