What is the right way to fix ext4 errors/corruptions on volumes on the Ceph storage?

tregubovav · December 7, 2024, 8:37pm

After problem with my cluster I found that the 4 volumes (3 rootfs and one data volumes) in the 3 container instances have multiple errors like below (dmesq output from the incus cluster node):

[ 9055.177230] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440
[ 9150.545493] EXT4-fs error (device rbd7): ext4_validate_block_bitmap:423: comm ext4lazyinit: bg 127: bad block bitmap checksum
[ 9152.366853] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440
[ 9152.380677] EXT4-fs error (device rbd7): __ext4_find_entry:1693: inode #786568: comm prometheus: checksumming directory block 0
[ 9152.503874] EXT4-fs error (device rbd7): ext4_validate_inode_bitmap:105: comm prometheus: Corrupt inode bitmap - block_group = 96, inode_bitmap = 3145744
[ 9152.979961] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440
[ 9152.991895] EXT4-fs error (device rbd7): __ext4_find_entry:1693: inode #786568: comm prometheus: checksumming directory block 0
[ 9157.503979] EXT4-fs error (device rbd7): ext4_validate_block_bitmap:423: comm kworker/u8:5: bg 111: bad block bitmap checksum
[ 9159.739104] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440

[  266.401467] EXT4-fs error (device rbd2): ext4_validate_block_bitmap:423: comm incusd: bg 1: bad block bitmap checksum
[  266.412380] EXT4-fs error (device rbd2) in ext4_mb_clear_bb:6542: Filesystem failed CRC
[  266.423542] EXT4-fs error (device rbd2): ext4_validate_block_bitmap:423: comm incusd: bg 15: bad block bitmap checksum
[  268.710419] EXT4-fs error (device rbd2): ext4_validate_block_bitmap:423: comm ext4lazyinit: bg 49: bad block bitmap checksum
[  271.081652] EXT4-fs error (device rbd3): ext4_validate_block_bitmap:423: comm ext4lazyinit: bg 1: bad block bitmap checksum
[  273.589060] EXT4-fs error (device rbd7): ext4_validate_block_bitmap:423: comm ext4lazyinit: bg 127: bad block bitmap checksum
[  277.987183] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440
[  278.004281] EXT4-fs error (device rbd7): __ext4_find_entry:1693: inode #786568: comm prometheus: checksumming directory block 0
[  278.131252] EXT4-fs error (device rbd7): ext4_validate_inode_bitmap:105: comm prometheus: Corrupt inode bitmap - block_group = 96, inode_bitmap = 3145744
[  278.531097] EXT4-fs error (device rbd3): ext4_validate_inode_bitmap:105: comm samba: Corrupt inode bitmap - block_group = 0, inode_bitmap = 137
[  278.550398] EXT4-fs error (device rbd3) in ext4_free_inode:362: Filesystem failed CRC
[  278.721024] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440
[  278.747239] EXT4-fs error (device rbd7): __ext4_find_entry:1693: inode #786568: comm prometheus: checksumming directory block 0
[  281.316261] EXT4-fs error (device rbd3): ext4_lookup:1857: inode #16: comm samba: deleted inode referenced: 25
[  281.326675] EXT4-fs error (device rbd3): ext4_lookup:1857: inode #16: comm samba: deleted inode referenced: 25
[  282.974984] EXT4-fs error (device rbd7): ext4_validate_block_bitmap:423: comm kworker/u8:22: bg 111: bad block bitmap checksum
[  339.957217] EXT4-fs error (device rbd7): ext4_lookup:1857: inode #786433: comm prometheus: deleted inode referenced: 786440

All impacted rootfs and data volumes attached to the instances are located on the ceph storage (actaully microceph).

What is the right way to fix these filesystem errors?

stgraber · December 7, 2024, 8:47pm

Stop the instance, manually map the instance using rbd map and then run fsck.ext4 on the /dev/rbdX device.

tregubovav · December 8, 2024, 7:00am

Steps for fixing filesystem errors on ceph block device:

Identify rbd image you need to repair. Use sudo rbd ls --pool <pool name for incus storage>.
Stop the instance.
Map image to the incus host using sudo rbd map <image name> --pool <pool name for incus storage>. Command will return a block device name /dev/rbdX
Run e2fsck /dev/rbdX or 'fsck -t ext4 /dev/rbdX` and follow to instructions.
Unmap image using sudo rbd unmap /dev/rbdX
Start the instance.