I am running a LXD cluster with 5 nodes using CEPH as storage. One of the disks became faulty and the corresponding OSD, Object Storage Daemon, was down. The physical disk must be replace. Here are the procedure I used to finish the job:
To check the status of the OSD tree:
#ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 20.37404 root default -7 4.00249 host node1 4 hdd 2.18320 osd.4 down 1.00000 1.00000 5 hdd 1.81929 osd.5 up 1.00000 1.00000 -3 3.63858 host node2 0 hdd 1.81929 osd.0 up 1.00000 1.00000 1 hdd 1.81929 osd.1 up 1.00000 1.00000 -5 1.81940 host node3 2 hdd 0.90970 osd.2 up 1.00000 1.00000 3 hdd 0.90970 osd.3 up 1.00000 1.00000 -9 3.63620 host node4 6 hdd 1.81810 osd.6 up 1.00000 1.00000 7 hdd 1.81810 osd.7 up 1.00000 1.00000 -11 7.27737 host node5 8 hdd 3.63869 osd.8 up 1.00000 1.00000 9 hdd 3.63869 osd.9 up 1.00000 1.00000
It was noted that osd.4 was down. The refilling process should have already been started. It can be checked by:
The faulty osd needed to be marked out, stop and removed.
#ceph osd out osd.4
#ceph osd stop osd.4
#ceph osd crush remove osd.4
Wait until the refilling completed.
Delete keyrings from osd.4
#ceph auth del osd.4
#ceph osd rm osd.4
Now, replace the physical disk.
In the node with the faulty physical disk, create the ceph-volume
#ceph-volume lvm create --data /dev/sdb
Restart the ceph-osd service:
#systemctl restart ceph-osd@4
The CEPH cluster will start the refilling process. Wait until it is done. In my case, it took around 3 hours to complete. Check the status of the ceph cluster:
#ceph osd tree