Skip to content

Troubleshooting Kafka alerts

Why do that

PunchPlatform admin GUI has reported RED or YELLOW status of a KAFKA cluster, or Nagios has reported such an error status (forwarded from PunchPlatform admin service)

What to do

  • Use PunchPlatform admin UI to determine which cluster encounters non-nominal condition, and impacted topics.
  • Check server state and resources for Vms of the concerned cluster (front : tnlsockaf01p-03p), using Nagios and Grafana  servers » dashboard
  • If nothing wrong is found using these tools, use « kafka-topic --describe » command on an impacted topic. This will reveal the ID of the non-working broker(s)
  • If NO broker answers in cluster,

    • check zookeeper status, and registered brokers ids
    • Check supervisorctl status on brokers
    • check any kafka broker logs for errors
    • check communication between brokers (using telnet from one broker VM to other broker port)
  • If only some broker(s) Id(s) are missing,

    • determine which broker has this ID from zookeeper records
    • Check broker logs,
    • check system resources (/data, memory),
    • try broker restart (sudo supervisorctl restart kafka-front),
    • Using telnet, check communication with zookeeper and between brokers
    • try full server restart
  • If broker is diagnosed as unavailable and it is not possible to fix it quickly, assign manually new data replica to remaining kafka brokers of the cluster to regain resilience (procedure is the same as explained in HOWTO alter existing kafka topics.