Troubleshooting Ceph insertion slow-down¶
Why do that¶
Your kibana dashboards may report Ceph insertion has slowed down compare to usual insertion rates (you can observe it on a Ceph Insertion Rate dashboard or on a Ceph Kafka Backlog dashboard with an increasing backlog).
When your Kafka backlog will be filled, you may lose logs, so you have to fix this problem.
Official CEPH documentation¶
Last Internet documentation for ceph open-source product is available at http://docs.ceph.com/docs/master/#
What to do¶
-
Check Ceph cluster status. On a PunchPlatform administration station, run following command :
ceph -c /etc/ceph/MyClusterName.conf
If an error occurs, check your cluster name (usually main) and existence of configuration file (MyClusterName.conf) and administration key (/etc/ceph/MyClusterName.client.admin.keyring). Check files rights.
If command succeeded, you now have a Ceph shell. Run following command:
status
-
If you obtain an HEALTH_OK, your Ceph cluster is fine. We recommend you to investigate in another way (archiving topologies configuration for example). If you obtain anything else (HEALTH_WARN with some worrying messages for example), check if all osd (data node) and all monitor (monitoring node) are UP and IN your cluster.
-
If one node (or more) is DOWN or if you don't retrieve all your nodes (for example you see 27 osds in output instead of 28 osds declared in your cluster), you're in a degraded situation. Inventory all missing nodes with following command:
osd dump
-
Connect (by SSH) to all inventories nodes and restart OSDs et MONs with following pattern command:
sudo systemctl restart ceph-osd-MyClusterName@MyOsdInstance
sudo systemctl restart ceph-mon-MyClusterName@MyMonInstance