Lxc copy seems stuck at something

akliouev · November 25, 2021, 7:12pm

I have a small LXD server deployed on Ubuntu server 20 with snap-installed LXD 4.0.8 that was working perfectly well, but today started to show some strange behaviour

The configured containers do work just fine, but “lxc copy container1 container2” never finishes. It just sits there and it seems that does nothing.
If I try to ^C it a warning “This operation can’t be cancelled (interrupt two more times to force)” is given
I stopped the process with a double ^C and attempted to reboot the system. My ssh session went down, but the system remained in a pingable but not accessible state for at least 20 minutes and I have to power-cycle it to reboot. So the problem seems to block something in the system.

The reboot didn’t solve anything – same never-ending “lxc copy”. Nothing criminal in the logs either. In fact according to the logs the container was successfully copied, but when I’ve attempted to start it it failed…

Again, the already-created containers are working just fine, I can apply different profiles and all, but for some reason can’t copy…

Did anyone observe something like that and/or knows how to fix this?

stgraber · November 25, 2021, 7:24pm

You’d want to look at dmesg and ps fauxww to see if anything is misbehaving at the kernel level or if any of the migration sub-processes are still running.

The type of storage pool used would also be good to know.

akliouev · November 25, 2021, 7:44pm

dmesg has a number of entries like

[ 5680.121956] audit: type=1400 audit(1637853252.963:51): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-proto12_</var/snap/lxd/common/lxd>" pid=46607 comm="apparmor_parser" [ 5680.248276] physsOydz3: renamed from maceccfe2f2 [ 5680.296383] physrNgGlB: renamed from veth86c23cd1 [ 5685.394288] device vethaf8b2ae4 left promiscuous mode [ 5685.394375] lxdbr0: port 3(vethaf8b2ae4) entered disabled state [ 5687.076554] audit: type=1400 audit(1637853259.919:52): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-proto12_</var/snap/lxd/common/lxd>" pid=46726 comm="apparmor_parser"

ps fauxww shows nothing unusual. No Zombies or Deads. The “lxc copy” process is reported to be in the state “Sl+”

The storage is plain vanila zfs on a local file:
lxc storage show default
config:
size: 5GB
source: /var/snap/lxd/common/lxd/disks/default.img
zfs.pool_name: default
description: “”
name: default
driver: zfs

stgraber · November 25, 2021, 7:50pm

No zfs send or zfs receive processes running?

akliouev · November 26, 2021, 8:42am

On closer inspection this morning I’ve noticed 4 zfs processes that are running under the lxd’s process tree. 2 zombies "[zfs] " and 2 seemingly sleeping processes like “zfs send -R default/containers/proto@copy-14f22415-154f-45d2-a610-e592cde613c3”.
This is strange as I didn’t configure anything special for these 2 containers and the other running containers seems not to have a zfs copy associated with them…

akliouev · November 26, 2021, 11:11am

Deeper inspection showed that the ZFS pool was almost full and there was barely any space left for new containers. Deleted some unused containers and lxc copy started to work as expected.
Thank you @stgraber for showing the right direction

stgraber · November 26, 2021, 4:23pm

Ah yeah, ZFS tries very hard not to ever fail and instead will queue or slow down I/O significantly when the pool is getting full (same if you’re getting close to hitting a quota).