New LVM behaviour in LXD?

vic-t · March 14, 2022, 11:32am

I just spent 4 days recovering lost containers and lost data…

Shortly before I opened this thread, I switched from local LXD shipped with Ubuntu 18.04 to the much newer snap version. As mentioned above, my backup workflow would no longer work and I didn’t understand why.

Per your confirmation, it should have been possible for me to continue activating the volumes and mount them. All I needed is to add the -K flag to the lvchange command.

And that’s where the problems started. After I did that, I was confronted with the message that the transaction_id of the thin pool is not the one that’s expected.

I had no clue that this had something to do with LXD, I really thought there’s something wrong on my end. When it comes to LXD but also LVM or file system internals, I’m really nothing but a user. Which means that at the time I was not even close to understanding where the problem lied.

An odyssey with hours, days spent learning the internals of LXD, LVM and some file systems began. And since my backups stopped working, I had no choice but to try and restore the data somehow from their corrupted mediums.

Basically, every single time I tried to access the lv from the host directly and afterwards started the container in lxd, I would suffer metadata corruptions. As mentioned, I had no clue what this even means and, as most people, I was looking for the quick fix. Which made things a lot worse. No matter which vg I would restore, which metadata partition I would activate, I would always be confronted with errors.

I understand now, of course, how I should have acted and maybe the whole problem could have been avoided or solved a lot quicker at least. You and Stéphane outlined the problem and its solution in this post: Resize LXD container - LXD - Linux Containers Forum. Instead, I’m still grappling to get my production servers back online.

Needless to say, a word of warning from you would have been nice. It’s probably because I didn’t even mention the word snap that I didn’t get it so I’m blaming myself. But as mentioned, I really couldn’t even fathom that the difference between the two would be so huge that a workflow I used for years could lead to data loss.

So, after all that I’ve been through these past few days, I do have to wonder. Considering that the potential of something going really, really wrong just by switching from apt to snap for lxd, which I believe is a terribly relevant scenario, why not make snap set lxd lvm.external=true the default? Don’t get me wrong, I understand the appeal of running it right from the snap but from my perspective, I wish this is something I could have changed later, ideally after reading an official article, not just a discussion, about why to do it and what to take note of.

Anyway, it is what it is. Now, since mounting the thin lvs will break my system, what is the solution? How can I access the container from the host system in order to make a full but file-based system backup? Should I really just set lvm.external to true or is there a better, cleaner way that you recommend?