New LVM behaviour in LXD?

vic-t · February 27, 2022, 1:41am

Today, I noticed some rather strange behavior in the interaction between LXD and LVM that I can’t quite understand.

Before, I used to do the following to backup my containers:

I created a snapshot using lxc snapshot. This resulted in a new thin volume in my LVM pool.
I mounted the newly created snapshot.
I backed up the mounted filesystem
I unmounted the snapshot
I deleted it using lxc delete

For some reason, this doesn’t seem to work for me anymore. When I run lxc snapshot, the lvm volume is created but remains inactive. So I can’t mount it for a backup.

I also noticed that as soon as I stop a container in LXC, its LVM volume is set to inactive.

Can someone confirm if this is how it’s supposed to be or is something broken on my server? If that’s how it should be, how can I now backup the container volume?

Thanks.

tomp · March 1, 2022, 7:50am

Yes that is correct, it has been that way for sometime actually.
The LVM volumes are deactivated when they are not needed in the same ZFS volumes are, so that their devices do not show on the system.

However you are free to activate them and mount them if needed.

vic-t · March 14, 2022, 11:32am

I just spent 4 days recovering lost containers and lost data…

Shortly before I opened this thread, I switched from local LXD shipped with Ubuntu 18.04 to the much newer snap version. As mentioned above, my backup workflow would no longer work and I didn’t understand why.

Per your confirmation, it should have been possible for me to continue activating the volumes and mount them. All I needed is to add the -K flag to the lvchange command.

And that’s where the problems started. After I did that, I was confronted with the message that the transaction_id of the thin pool is not the one that’s expected.

I had no clue that this had something to do with LXD, I really thought there’s something wrong on my end. When it comes to LXD but also LVM or file system internals, I’m really nothing but a user. Which means that at the time I was not even close to understanding where the problem lied.

An odyssey with hours, days spent learning the internals of LXD, LVM and some file systems began. And since my backups stopped working, I had no choice but to try and restore the data somehow from their corrupted mediums.

Basically, every single time I tried to access the lv from the host directly and afterwards started the container in lxd, I would suffer metadata corruptions. As mentioned, I had no clue what this even means and, as most people, I was looking for the quick fix. Which made things a lot worse. No matter which vg I would restore, which metadata partition I would activate, I would always be confronted with errors.

I understand now, of course, how I should have acted and maybe the whole problem could have been avoided or solved a lot quicker at least. You and Stéphane outlined the problem and its solution in this post: Resize LXD container - LXD - Linux Containers Forum. Instead, I’m still grappling to get my production servers back online.

Needless to say, a word of warning from you would have been nice. It’s probably because I didn’t even mention the word snap that I didn’t get it so I’m blaming myself. But as mentioned, I really couldn’t even fathom that the difference between the two would be so huge that a workflow I used for years could lead to data loss.

So, after all that I’ve been through these past few days, I do have to wonder. Considering that the potential of something going really, really wrong just by switching from apt to snap for lxd, which I believe is a terribly relevant scenario, why not make snap set lxd lvm.external=true the default? Don’t get me wrong, I understand the appeal of running it right from the snap but from my perspective, I wish this is something I could have changed later, ideally after reading an official article, not just a discussion, about why to do it and what to take note of.

Anyway, it is what it is. Now, since mounting the thin lvs will break my system, what is the solution? How can I access the container from the host system in order to make a full but file-based system backup? Should I really just set lvm.external to true or is there a better, cleaner way that you recommend?

tomp · March 14, 2022, 11:39am

I’m afraid it didn’t come to mind as we weren’t chatting about an upgrade from a very old non-snap version to the current version. There have been many prior versions of LXD in the snap package that didn’t include the deactivate behavior so I had thought most likely you were coming from one of those.

When using snap set lxd lvm.external=true do you still get the error when activating the lvs?

vic-t · March 14, 2022, 11:46am

Honestly, I think it’s just too late for that, I took so many steps trying to correct the issue directly on a LV and FS basis that I think I messed it up entirely. What I didn’t mess up, fsck did.

What I did now is create a second thin pool, added it to lxc and I’m trying to recreate all containers from scratch, moving the data which I could salvage.

Is there any downside to making this setting? Can anything happen to my new containers that were created using lxd snap? What’s your recommendation on this whole topic anyway?

vic-t · March 14, 2022, 12:24pm

@tomp I’m further trying to understand, since LXD is trying to take over responsibility for managing the storage pool - which is something I welcome - how are the sizes of both volumes and storage pools going to be managed?

For volumes, I understand that I can just set the new size for the root volume using lxc config device set myCont root size XGB. What about the thin pools which hold those volumes? Are they automatically being resized as the volumes need more space? If not, how do I do this without using lvm tools?

vic-t · March 14, 2022, 4:03pm

@tomp @stgraber Apologies for the urgency but I’m going crazy here.

After wasting 4 days salvaging parts of my data somehow, today, on a fresh disk, I created a fresh PV, a fresh thin pool and fresh containers.

Now, I decided to resize my PV because I increased the size of the underlying partition. I don’t see any way of doing this using lxd commands so I ran a simple ‘pvresize /dev/sdc2’. This automatically increased the size of the vg, too.

And now, when I try to start one of the containers in the thin pool, I get the following message that already aches when I see it:

root@sb32 ~ # lxc start c1
Error: Failed to activate LVM logical volume "/dev/vgr/containers_c1": Failed to run: lvchange --activate y --ignoreactivationskip /dev/vgr/containers_c1: Thin pool vgr-lvpool-tpool (253:23) transaction_id is 11, while expected 10.
Try `lxc info --show-log c1` for more info

I’m going nuts, really. Could you please help me in solving this?

I have a backup of the vg just before I ran pvresize but I’m afraid if I’m going to do a vgcfgrestore, I’m going down that rabbit hole again.

tomp · March 14, 2022, 4:11pm

Is this with using snap set lxd lvm.external=true ?

vic-t · March 14, 2022, 4:13pm

It wasn’t turned on when I ran the pvresize. But now with it turned on or not, I’m getting the same error.

tomp · March 14, 2022, 4:14pm

Can you try it with snap set lxd lvm.external=true set from the outset of PV creation as it may be some incompatibility with the old host LVM tools and the ones inside the snap.

vic-t · March 14, 2022, 4:15pm

I’m sorry, I don’t understand what you mean. Can you elaborate?

tomp · March 14, 2022, 4:16pm

Sorry you said you had created a new LXD thinpool on a PV, so can you try creating that again but with LXD set with snap set lxd lvm.external=true whilst you do it.

vic-t · March 14, 2022, 4:17pm

Ok but what about my data? I spent a day configuring this container. I need to be able to get it back somehow…

tomp · March 14, 2022, 4:19pm

I was thinking creating a fresh pool to experiment with.

vic-t · March 14, 2022, 4:20pm

More than happy to experiment later with you but I have productive data on this container…

vic-t · March 14, 2022, 4:22pm

What I don’t understand: The PV was created from the host system, so was the VG. Only the thin pool was created by LXD. Why is LXD bothered by changes in the PV? It’s strange…

vic-t · March 14, 2022, 4:29pm

with snap set lxd lvm.external=true

pvcreate /dev/sdc4 #success
vgcreate vgtest /dev/sdc4 #success
lxc storage create testpool lvm source=vgtest lvm.vg.force_reuse=true lvm.use_thinpool=true lvm.thinpool_name=lvthintest #success
lxc launch ubuntu-minimal:20.04 c1 -s testpool #fails

Name: c1
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/03/14 17:26 CET
Last Used: 2022/03/14 17:26 CET

Log:

lxc c1 20220314162659.528 ERROR    utils - utils.c:lxc_can_use_pidfd:1792 - Kernel does not support pidfds
lxc c1 20220314162659.540 WARN     conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc c1 20220314162659.541 WARN     conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
lxc c1 20220314162659.552 WARN     conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc c1 20220314162659.552 WARN     conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
lxc c1 20220314162659.365 ERROR    start - start.c:start:2164 - No such file or directory - Failed to exec "/sbin/init"
lxc c1 20220314162659.365 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 7)
lxc c1 20220314162659.371 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "veth84606965"
lxc c1 20220314162659.371 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:877 - Received container state "ABORTING" instead of "RUNNING"
lxc c1 20220314162659.371 ERROR    start - start.c:__lxc_start:2074 - Failed to spawn container "c1"
lxc c1 20220314162659.371 WARN     start - start.c:lxc_abort:1045 - No such process - Failed to send SIGKILL to 4096
lxc c1 20220314162704.515 WARN     conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing
lxc c1 20220314162704.515 WARN     conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing
lxc 20220314162704.519 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220314162704.519 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors for command "get_state"

It’s getting worse before it gets better…

vic-t · March 14, 2022, 4:33pm

I never had these issues by the way. I was creating containers like there’s no tomorrow today. This is a new one, related to the lvm.external=true setting, I can only guess…

I would really appreciate some support with this, I’m desperate.

tomp · March 14, 2022, 4:44pm

OK I spoke to @stgraber

We suggest that you snap set lxd lvm.external=true and then reboot (this will ensure that the LVM metadata cache is in sync on the host and in the snap).

Then create your new pv and volume group again.

vic-t · March 14, 2022, 4:45pm

Ok, will do that.