Skip to content

HOWTO change ceph osd nodes in production

Why Do that

If you want more insight after reading this page, please refer to CEPH documentation

Use cases of this page

You need to check/change the usage of storage nodes within a working CEPH cluster because : - you have a shot maintenance to perform on a storage node and you do not want to disturb the cluster or slow down the production - an OSD machine has been definitively lost and you want to replace/reinstall - you want to add or remove a machine to change the overall storage capacity of your CEPH storage cluster

Prerequisites

You need to have an application administration linux account on the PunchPlatform administration server.

The standard test to ensure you have appropriate environment and rights is to run . This should provide a list of numbers which are the ceph identifiers of all the known storage nodes of your ceph cluster.

Things to know

  • For resilient and performance purpose, each object (usually a file) stored in Ceph is split in chunks that will be stored in separate OSD/Object Storage Nodes (so that disk write is faster because parallelized on multiple servers, and so that the loss of a server will lose only part of the data) and raid-like additional resilience chunks are computed and stored on yet another servers, so as to be able to rebuild loss chunks, up to the loss of a number of servers equals to the number of resilience chunks.

    a file will be therefore split for example in 10 chunks, and then additional 4 chunks of resilience will be computed, so that data stays available as long as there is no more than 4 loss of storage servers. Therefore, in this example, for a 1 MBytes file, a total of 1.4 MBytes of data will actually be dispatched in the cluster disks.

    computing resilience chunks, or re-computing any missing chunk takes IO and cpu because the cluster need to take 10 of the remaining chunks and combine them together in this computation.

    this overall chunking/computation resilience mechanism implemented by CEPH is called .

  • To have a good balancing of work in the cluster, the various files chunks are not stored always in the same OSDs (otherwise in our example, only 14 OSDs would receive data, and others would have nothing.) BUT to avoid having a different chunks distribution for each file, the CEPH cluster computes a small list of chunks distributions variants, called PLACEMENT GROUPS, and then each stored object will follow the chunks storage pattern of its PG. Therefore, when there is a failure or a recovery/rebalancing of data inside the cluster, many cluster information/monitoring/decisions apply to whole PGs, and not on individual files.

  • If an OSD is down for some time, then ceph will TEMPORARILY assign the role of the missing OSD to an other one, so that the object writes that occur are actually creating objects as resilient as if there was no missing OSD. At this time, an OSD is starting to "act as " the missing OSD for some of the Placement Groups.
  • If an OSD is down for too much time, then ceph will start FULLY REBUILDING resilience on the remaining nodes, which will cause a lot of ios and CPU (because ceph will recompute the raid-like algorithm to rebuild the missing data chunks that are now off-line). This may slow down too much your production !
  • If after an OSD being shut for too much time, you just switch on the failed OSD, you may cause a lot of ios because ceph will move to the proper place all data that has been written on the other node but should have normally been stored on the failed one.
  • If we do not take care when inserting a new OSD in an existing cluster, then CEPH will immediately massively move parts of the existing data chunks to the new OSD, in order to balance storage usage among the nodes. This effect can be reduced by appropriate usage of the nodes floating number which controls relative usage of the nodes. A node with a weight of will receives no data (and will be freed of its existing data by moving it to an other node). Normal weight of a node is .
  • an OSD is identified by its identifier ( an integer, starting at 0 for first node ). If an OSD machine is definitively lost, then the OSD number cannot be reused for a replacement node before removing it from the memory of the cluster ; so just running the PunchPlatform deployer on a brand new VM with same hostname will NOT work if you first did not tell the cluster that the old OSD exists no more

preparing a maintenance/temporary shutdown of an OSD

Check out the official documentation : here

You can prevent rebalancing during maintenance of an osd by issuing :

ceph osd set noout

You can reactivate automatic rebalancing after maintenance :

ceph osd unset noout

Warning

If you have prevented automatic rebalancing, then if during maintenance you encounter a failure of an other OSD, and you want to rebuild resilience to take into account the failed OSD without waiting for the end of the maintenance phase, then you can manually remove the failed osd from the cluster by issuing a 'ceph osd out osd. '. This will cause rebalancing/reconstruction of resilience.

Adding a new OSD to a running cluster

  • declare additional OSD in punchplatform-deployment.settings with an identifier higher than previously included OSDs AND with the specific parameter : 0.0
"LMCSOCCPH04I" : {
    "id" : 3,
    "device" : "/dev/sdb",
    "production_address" : "LMCSOCCPH04I.prod.punchbox.thales",
    "initial_weight" : 0.0
}
  • check that your cluster is in nominal status (issue a [ceph --cluster main health] on PunchPlatform application administration machine linux account)
  • run punchplatform-deployer.sh --deploy -t ceph from deployment environment (additional options may be needed as for initial deployment, regarding administration accounts ; see cluster deployment section)

The created OSD should join the cluster (issue a [ceph --cluster main osd list] on PunchPlatform application administration machine linux account)

The created OSD has an initial weight of 0 (which means no data is stored on this node).

  • To start rebalancing some data within the cluster, issue [ceph --cluster main osd crush reweight osd. 0.01]. WARNING : this will cause the cluster health to go into non-nominal state as long as the data is not rebalanced as required

You can confirm that new data is flowing to the new OSD by issuing :

ceph --cluster main pg dump osds
  • Once the cluster has gone back to nominal HEALTH, and actual impact of the partial rebalance on production has been assessed, you can gradually (or progressively) continue rebalancing by issuing other reweight commands, up to a 1.0 factor (which means the node must contain the same amount of data as other nominal nodes)

Definitively removing a (failed) OSD from a running cluster

Warning

This procedure should only be used for OSDs which cannot be restarted and cannot therefore never again be set as 'in ' the cluster (for example, loss of the content of the OSD storage partition on this node)

  • identify that your OSD is the only missing one in the cluster (issue a ceph --cluster main osd dump on PunchPlatform application administration machine linux account and check each osd status) ; note the identifier (number after osd.) of your failed osd. You can also looking up the OSD hostname and matching in Deployment.settings

    1
    2
    3
    ceph health detail
      HEALTH_WARN 1/10 in osds are down
      osd.7 is down since epoch 23, last address 192.168.106.220:6800/11080
    
  • remove node from cluster usage by issuing :

    1
    ceph --cluster main osd out osd.<identifier of the failed osd>`
    
  • check again detail of cluster health to ensure there is no error in your cluster

  • definitively remove node from cluster awareness :

    1
    2
    3
    ceph --cluster main osd crush remove osd.<identifier of the failed osd>`
    ceph --cluster main auth del osd.<identifier of the failed osd>`
    ceph --cluster main osd rm <identifier of the failed osd>`
    
  • either remove the osd from the punchplatform-deployment.settings and re-deploy the 'ceph ' tag (if you do not need this osd anymore) OR re-deploy a new VM with same network settings and clean storage partitions, and then run procedure same as for adding a new OSD to a running cluster.