Replacing a faulty disk in a CEPH cluster

I am running a LXD cluster with 5 nodes using CEPH as storage. One of the disks became faulty and the corresponding OSD, Object Storage Daemon, was down. The physical disk must be replace. Here are the procedure I used to finish the job:

To check the status of the OSD tree:

#ceph osd tree

ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
 -1         20.37404  root default                                
 -7          4.00249      host node1                           
  4    hdd   2.18320          osd.4        down   1.00000  1.00000
  5    hdd   1.81929          osd.5          up   1.00000  1.00000
 -3          3.63858      host node2                            
  0    hdd   1.81929          osd.0          up   1.00000  1.00000
  1    hdd   1.81929          osd.1          up   1.00000  1.00000
 -5          1.81940      host node3                            
  2    hdd   0.90970          osd.2          up   1.00000  1.00000
  3    hdd   0.90970          osd.3          up   1.00000  1.00000
 -9          3.63620      host node4                               
  6    hdd   1.81810          osd.6          up   1.00000  1.00000
  7    hdd   1.81810          osd.7          up   1.00000  1.00000
-11          7.27737      host node5                             
  8    hdd   3.63869          osd.8          up   1.00000  1.00000
  9    hdd   3.63869          osd.9          up   1.00000  1.00000

It was noted that osd.4 was down. The refilling process should have already been started. It can be checked by:

#ceph status

The faulty osd needed to be marked out, stop and removed.

#ceph osd out osd.4
#ceph osd stop osd.4
#ceph osd crush remove osd.4

Wait until the refilling completed.

Delete keyrings from osd.4

#ceph auth del osd.4

Remove osd.4

#ceph osd rm osd.4

Now, replace the physical disk.

In the node with the faulty physical disk, create the ceph-volume

#ceph-volume lvm create --data /dev/sdb

Restart the ceph-osd service:

#systemctl restart ceph-osd@4

The CEPH cluster will start the refilling process. Wait until it is done. In my case, it took around 3 hours to complete. Check the status of the ceph cluster:

#ceph osd tree

2 Likes