Skip to content

Troubleshooting Ceph insertion slow-down

Why do that

Your kibana dashboards may report Ceph insertion has slowed down compare to usual insertion rates (you can observe it on a Ceph Insertion Rate dashboard or on a Ceph Kafka Backlog dashboard with an increasing backlog).

When your Kafka backlog will be filled, you may lose logs, so you have to fix this problem.

Official CEPH documentation

Last Internet documentation for ceph open-source product is available at http://docs.ceph.com/docs/master/#

An offline copy (at the time of release of this PunchPlatform software version) of this documentation is available here

What to do

  1. Check Ceph cluster status. On a PunchPlatform administration station, run following command :

    ceph -c /etc/ceph/MyClusterName.conf

    If an error occurs, check your cluster name (usually main) and existence of configuration file (MyClusterName.conf) and administration key (/etc/ceph/MyClusterName.client.admin.keyring). Check files rights.

    If command succeeded, you now have a Ceph shell. Run following command:

    status

  2. If you obtain an HEALTH_OK, your Ceph cluster is fine. We recommend you to investigate in another way (archiving topologies configuration for example). If you obtain anything else (HEALTH_WARN with some worrying messages for example), check if all osd (data node) and all monitor (monitoring node) are UP and IN your cluster.

  3. If one node (or more) is DOWN or if you don't retrieve all your nodes (for example you see 27 osds in output instead of 28 osds declared in your cluster), you're in a degraded situation. Inventory all missing nodes with following command:

    osd dump

  4. Connect (by SSH) to all inventories nodes and restart OSDs et MONs with following pattern command:

1
2
sudo systemctl restart ceph-osd-MyClusterName@MyOsdInstance
sudo systemctl restart ceph-mon-MyClusterName@MyMonInstance

Replace MyClusterName by your cluster name (usually main),MyOsdInstance by your OSD node identifier (1 if it's the node 1,etc.) and MyMonInstance by your MON node identifier. Nodes identifiers have been specified in Ceph section of punchplatform-deployment.settings configuration file. Of course use one of the two commands according if you want to restart an OSD or a MON.