Error removing LVM logical volume

Can you try nsenter -t 91906 -m umount /var/snap/lxd/common/lxd/storage-pools/secondary/containers/lxcc39dbb31?

I rebooted the system as suggested and try to re produce the bug right now by starting and stopping it.
As soon it hangs again, I let you know.

So far the script restarted the container 120 times, no issue yet.
I will check if I find a stuck container somewhere else, and give you the output when I get to it.

The script rebooted the container the entire fucking night, no issue yet.
Either I am unlucky or the bug needs something else.

I think something else, is triggering it and when you shut it down, the issue gets visible.

The script ran for another day, can’t reproduce the error yet.
Will check for another stuck container and post the debug output here shortly.

Maybe related to:

I found another container:
https://pastebin.com/BgAFK1t9

@stgraber

Can you try:

nsenter -t 1471 -m umount /var/snap/lxd/common/shmounts/storage-pools/primary/containers/lxc14fc3901

Seems not to work.
https://pastebin.com/raw/dxnVfaQf

Can you show cat /proc/1471/mountinfo?

Sure
https://pastebin.com/raw/GG5UDYDj

Something caused some repeated overmounting of the shmounts directory somehow…
Can you show journalctl -u snap.lxd.daemon -n 500 as well as snap changes?

Yea.
https://pastebin.com/raw/kYxDujPg

The journalctl output looks significantly shorter than the requested 500 lines.

That’s all what it returns.

That’s annoying as that doesn’t get us any real history to see what happened.

It shows an issue with namespace management which then caused over-mounting…
We’ve seen this on and off but haven’t yet found a reliable reproducer making it near impossible to track this down.

I got 9 nodes on this Project, could find it on 2.
1 was rebooted, 2 where not affected, one I found that what I posted now.

I check later if on the remaining ones I can find a container affected by this, I am pretty sure, that I will.

All of them run using the same LVM backend setup, but according to other posts it may not related to LVM, since one user reported it with ZFS.

Yeah, this issue has mostly been seen on ZFS, likely because we have far more ZFS users than LVM.

Our best guess is that it has to do with a system having running containers, combined with an update to some of the core snaps and combined with a LXD snap update, some combination of this then results in a mount namespace configuration which our reshuffling tool can’t deal with, resulting in the error we can see in your journal.

Unfortunately fixing this properly will require us being able to reproduce this issue at will so we can significantly increase the debugging in the reshuffling tool as well as take detailed dumps of all mount tables at play.

In your case, a workaround for that one system would likely be:

  • nsenter -t 1471 -m umount -l /var/snap/lxd/common/shmounts
  • nsenter -t 1471 -m umount -l /var/snap/lxd/common/shmounts
  • nsenter -t 1471 -m umount -l /var/snap/lxd/common/shmounts/storage-pools/primary/containers/lxc14fc3901

That would undo the two level of overmounting and then unmount the hidden mount.

So basically the work around would be, disable all lxd related auto update over snapd.
Reboot after a snapd update, subscribe to your “newsletter” so whe know when, what to patch.

I found a few nodes more:

Also the function seems to be fucked.
journalctl -u snap.lxd.daemon -S 2021-9-1
Returns data from september 2021 but not 500 lines.

journalctl -u snap.lxd.daemon -n 500
Returns data from May 2021, also not 500 lines

There seems to be more log but I can’t get it printed.
I know there is data from September 29 and it dosen’t show it to me.

-r for reverse yes gives me the current one, but never 500 lines, more like hard coded 50.
I would like to give you the data but the system dosen’t give it to me.

I found this on way more machines than expected, nearly rebooted everything now.
snap is a plague, please add support for deb packages.

Found another one, despite running manual updates.
I did blocked snapd and ran manual updates every now and then, but snap does not seem to be the issue, the issue still appears. Looks like my assumption was wrong.