I am running a LXD cluster with 5 nodes using CEPH as storage. One of the disks became faulty and the corresponding OSD, Object Storage Daemon, was down. The physical disk must be replace. Here are the procedure I used to finish the job:
To check the status of the OSD tree:
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 20.37404 root default
-7 4.00249 host node1
4 hdd 2.18320 osd.4 down 1.00000 1.00000
5 hdd 1.81929 osd.5 up 1.00000 1.00000
-3 3.63858 host node2
0 hdd 1.81929 osd.0 up 1.00000 1.00000
1 hdd 1.81929 osd.1 up 1.00000 1.00000
-5 1.81940 host node3
2 hdd 0.90970 osd.2 up 1.00000 1.00000
3 hdd 0.90970 osd.3 up 1.00000 1.00000
-9 3.63620 host node4
6 hdd 1.81810 osd.6 up 1.00000 1.00000
7 hdd 1.81810 osd.7 up 1.00000 1.00000
-11 7.27737 host node5
8 hdd 3.63869 osd.8 up 1.00000 1.00000
9 hdd 3.63869 osd.9 up 1.00000 1.00000
It was noted that osd.4 was down. The refilling process should have already been started. It can be checked by:
ceph status
The faulty osd needed to be marked out, stop and removed.
ceph osd out osd.4
ceph osd stop osd.4
ceph osd crush remove osd.4
Wait until the refilling completed.
Delete keyrings from osd.4
ceph auth del osd.4
Remove osd.4
ceph osd rm osd.4
Now, replace the physical disk.
In the node with the faulty physical disk, create the ceph-volume
#ceph-volume lvm create --data /dev/sdb
Restart the ceph-osd service:
#systemctl restart ceph-osd@4
The CEPH cluster will start the refilling process. Wait until it is done. In my case, it took around 3 hours to complete. Check the status of the ceph cluster:
ceph osd tree