Skip to content

Troubleshooting a Channel that is not processing documents

Why do that

  • There are no plaform alert (neither infrstructure or system-level alerts from supervision, nor applicative platform alerts from the punchplatform REST API)

BUT

  • (part of) a channel seems to do nothing (no metrics, or 0 rate, maybe the backlog is raising)

What to do

1 - Determine the actual failure point

if the channel is made of multiple successive topologies (e.g. LTR_IN => LTR_OUT => LMR_IN => PROCESSING => OUTPUT ), then use the metrics from Kibana to determine which is the FIRST topology in the chain that is not processing correctly.

To determine a topology is working correctly, either :

  • in the metrics Kibana/Elasticsearch, we have
    • recent metrics in -metrics-* (less than a few minutes old)
    • The 'storm.worker.uptime.count' metric (seconds) from the topology indicates the topology has been running for more than 5 minutes
    • The 'storm.tuple.fail.m1_rate' metrics from the topology spouts indicate there are no recent failures (value < 1 failure / s)
    • The 'storm.tuple.ack.m1_rate' metrics from the topology spouts indicate there are some recent successful processing

OR

  • in the Storm GUI (see punchplatform.properties to identify UI host and port for the storm cluster that hosts the topology) :
    • the topology is listed in ACTIVE status
    • its number of workers is >= 1
    • when clicking on the topology link, then on '10m0s' in its 'Topology stats section', then its page indicates :
      • an uptime > 5 minutes in each line of the "Worker Resources" section
      • only 0 (zeroes) in the 'Failed' column of
        • the '10m' line in the 'Topology stats' section
        • the Spouts section
        • the Bolts section
      • nothing displayed in the 'Last error' column of the "Spouts" and 'Bolts' sections

2 - If the first failure point topology is not listed in the storm UI

The topology has not been submitted to the Storm cluster. Please use punchplatform-channel.sh --start <tenant>/<channel>/<cluster>/<topologyName> to start it. If the topology is said to be already running although you do not see it in the UI, then use --stop on this topology, and try again.

3 - If the topology is listed in the storm UI, but no detail is available (acks, fails...)

Either:

  • no worker has been assigned by the storm cluster (maybe no more slots available) ==> check this by looking at the 'free slots' count in the main page of Storm UI for this cluster

  • The topology is failing too fast after each start, so that no stats or reason for failure is displayed
    ==> look into the worker logs to determine the reason why the topology is not starting ( cf. Locating the Storm worker log file ). If no error is displayed, but only the first logs of the topology starting are displayed, try raising the topology allocated memory (512 MB for example)

4 - If there is a "last error" displayed in the topology page of Storm UI, in the Spouts/Bolts section

  • if an "Out of Memory" error is displayed, check the current configuration file for the topology, and tune its settings to either reduce the memory need (reducing the topology.max_spout_pending for example, or the 'executors' settings), or raising the Xmx and Xms in 'topology.worker.childopts'

  • if some other error is displayed, retrieve the whole error stack trace and context from the worker log ( cf. Locating the Storm worker log file ). Depending on the error, solve the underlying problem (access to a network or storage resource, lack of storage space, unavailable kafka topic...) and restart the topology, or escalate to the product support team if this seems to be a software bug.

5 - If there are non-0 "Failed" counters in the last 10 minutes

  • Identify the Bolt or Spout that has the non-0 counters
  • If a bolt has 'Failed', then it means this specific bolt has encountered an error condition. Retrieve the storm worker log of the topology to obtain more detail/context about this error ( cf. Locating the Storm worker log file ).

    E.g. If this is a kafka bolt, check the kafka cluster health, the topics availability. Please note that if a kafka bolt is set to require immediate resilience when writing to kafka ( "producer.acks": "all" setting), then it will require availability of both copies of each target topic partition, not only one. This can be changed by putting "1" instead of "all", to allow the writing of documents into Kafka even if one kafka node is down.

  • If a spout has 'Failed', but NO BOLT has 'Failed', then it mans some documents have not been processed before the sorm timeout has elapsed (topology 'topology.message.timeout.secs' setting). This may be caused by a problem at the target/output of the topology (e.g. Elasticsearch cluster, or Network link / remote Storm topology , or Kafka overload/partial unavailability). Investigate on the health of this target component. You can also retrieve the storm worker log to see if an explicit context is causing the lack of output

6 - If there are only 0 in 'Failed', but (almost) nothing in 'Acked' in the last 10 minutes

  • Investigate if there are no issues with the source of documents for this topology (e.g. network issue).
  • If possible, inject test documents using punchplatform-log-injector.sh
  • If the topology spout is a Kafka spout,
    • check kafka cluster health, and availability of all partitions of the input topic
    • ensure that the 'kafka.spout.backlog.fetch' metric is >0 and increasing for this kafka spout in the metrics back end (otherwise, it means no message is actually coming into this topic)