I am running a LXD cluster with 5 nodes using CEPH as storage. One of the disks became faulty and the corresponding OSD, Object Storage Daemon, was down. The physical disk must be replace. Here are the procedure I used to finish the job:
To check the status of the OSD tree:
#ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 20.37404 root default
-7 4.00249 host node1
4 hdd 2.18320 osd.4 down 1.00000 1.00000
5 hdd 1.81929 osd.5 up 1.00000 1.00000
-3 3.63858 host node2
0 hdd 1.81929 osd.0 up 1.00000 1.00000
1 hdd 1.81929 osd.1 up 1.00000 1.00000
-5 1.81940 host node3
2 hdd 0.90970 osd.2 up 1.00000 1.00000
3 hdd 0.90970 osd.3 up 1.00000 1.00000
-9 3.63620 host node4
6 hdd 1.81810 osd.6 up 1.00000 1.00000
7 hdd 1.81810 osd.7 up 1.00000 1.00000
-11 7.27737 host node5
8 hdd 3.63869 osd.8 up 1.00000 1.00000
9 hdd 3.63869 osd.9 up 1.00000 1.00000
It was noted that osd.4 was down. The refilling process should have already been started. It can be checked by:
#ceph status
The faulty osd needed to be marked out, stop and removed.
#ceph osd out osd.4
#ceph osd stop osd.4
#ceph osd crush remove osd.4
Wait until the refilling completed.
Delete keyrings from osd.4
#ceph auth del osd.4
Remove osd.4
#ceph osd rm osd.4
Now, replace the physical disk.
In the node with the faulty physical disk, create the ceph-volume
#ceph-volume lvm create --data /dev/sdb
Restart the ceph-osd service:
#systemctl restart ceph-osd@4
The CEPH cluster will start the refilling process. Wait until it is done. In my case, it took around 3 hours to complete. Check the status of the ceph cluster:
#ceph osd tree