OK so I stepped back through each of the segments, but then I noticed that some of them were before the last snapshot, and after the problems started, and the last snapshot was after the problems started.
So since I figured you were not here, I removed all segments, I removed the latest snapshot, and copied the previous snapshot over db.bin. I then removed db.bin-wal (not sure what it’s for), and started LXD and the problem has gone away. I’m not sure how out of date the database is, I am sure there are some snapshots that LXD isn’t aware of, and some that were deleted that it thinks are still there, but I think we can clean that mess up.
I should have mentioned because it’s something folks often get confused by, db.bin and db.bin-wal are only there for debugging purposes, LXD never actually reads those files.
They’re a nice way for a normal sqlite3 client to take a look at how things look, but the files are re-generated from the snapshots and segments on LXD startup.
The timestamp on that snapshot file should give you an idea of how far back you went.
Look at the segments, it’s quite possible that you do have the needed segments to go from that older snapshot onto a recent transaction.
We specifically keep two snapshots around just for cases like this where one somehow experiences some kind of disk corruption.
So depending on how your LXD has been acting lately, it’s quite possible that you could get onto the latest state just from the older snapshot + all the segments.
(This isn’t always the case as we have configuration in place to avoid using excessive disk space and so will not retain a large number of segments. So if your snapshot is quite old and many segments have been written since, you may not have a path all the way to the latest transaction from an older snapshot).
Ahh so just removing the latest snapshot was all I needed to do, I didn’t have to copy anything over?
I have all the production containers up now so at least the crisis is over. I don’t think there are any issues where data is missing, though I’d imagine there could be something where a snapshot still exists and LXD isn’t aware of it? Will it clean up after itself?
Possibly, yeah, if you had a viable path from the older snapshot to current state, then just removing the latest snapshot would have worked. If not, LXD would then have complained about missing segments.
For the snapshot data, you probably should compare what you have in ZFS with what you have in LXD and then delete any ZFS snapshots that LXD isn’t aware of.