I don’t think there’s been any power loss as the node has been running since 56 days till now and still going on. Also there is 894 GB disk space left over right now in root volume.
Should I perform a system reboot as a last resort? Would that help? @stgraber
I don’t want the configuration or the lxd container to be spoiled by the reboot.
The good news is that you have 2 snapshots that contain data after the corrupt segment (15360 > 12487 and 16384 > 12487), so you will likely not incur data loss.
Let me double check why it’s actually trying to load that segment and come back to you.
We store checksums when writing the 000000xxx-000000xxx files and recalculate and compare the checksums when loading those files. When the checksums don’t match we report the error that you saw. One of the possible causes is an issue with your disk, or a bug in the implementation. Because this is an operation that is carried out frequently, I suspect a problem with your disk is more likely.
In /var/snap/lxd/common/lxd/database/global/ there are basically 2 types of database files.
segment files e.g. 0000000000007683-0000000000008349. These contain a limited amount of database entries numbered from 7683 until 8349 in the example.
snapshot files e.g. snapshot-1-15360-3022981382. These contain all of the database entries up until a certain index, in this case 15360. The snapshot files are generally small because they are compressed.
When the database starts up, it will load the latest snapshot file, in this case snapshot-1-16384-4522128417 and then load all entries in the segment files overlapping with the snapshot file. In your case it will load the entries in segment files.
0000000000015955-0000000000016768 /* Because 15955 < 16384 < 16768*/
0000000000016769-0000000000016828 /* Because 16384 < 16769 */
The entries in the other segment files are already contained in the snapshot and their information is not needed, that is why they could be deleted.
I decided to let you delete all segment files up until the problematic segment file, because it’s the most conservative approach. In theory you could have deleted all segment files, except 0000000000015955-0000000000016768 and 0000000000016769-0000000000016828 and you could have deleted snapshot-1-15360-3022981382 and snapshot-1-15360-3022981382.meta.
The startup logic of the database could be improved, because the problematic segment you encountered was technically not needed to start up the database and could have been ignored.
You can send me the problematic segment if you want at mathieu.bordere@canonical.com , I could investigate it when I have some time to try and find out what could have went wrong when loading it in case it wasn’t a disk failure.
Thanks @mbordere for the explanation. Understood the concept.
Regarding the problematic segment I had mentioned above in post all the information I could collect at that moment. Let me know if I can share anything more for your investigation.