Process '/usr/bin/unshare -m /usr/bin/snap auto-import....' failed with exit code 1 LXD 4.0.9 LTS snap

FireLXC · November 27, 2022, 2:37am

Hey,

I recently ran into the following error on one of my Systems, while starting a virtual machine:
"Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/dm-6' failed with exit code 1."

Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/23991/bin/qemu-system-x86_64 -S -name lxc675d97aa -uuid 23249d3e-222f-400d-9bcc-cbd19138896c -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/lxc675d97aa/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/lxc675d97aa/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/lxc675d97aa/qemu.pid -D /var/snap/lxd/common/lxd/logs/lxc675d97aa/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : Process exited with non-zero value -1

Ubuntu 22.04 with LXD 4.0.9 LTS, up to date.

I could not find this bug so far on other machines, despite running Ubuntu 22.04 with LXD 4 LTS.
Containers are not affected, running virtual-machines where also not affected.

I googled a bit and found someone else reporting it over 6 months ago:

Its marked as high but still not solved.
The 2 workarounds that where suggested, seem to not work at all.

I also checked dmesg for anything that could indicate where the root issue is, nothing as far as I can tell.

Anybody has a suggestion on this? Thanks.

tomp · November 28, 2022, 9:20am

Do you see this with a generally supported version of LXD, such as LXD 5.0 LTS (available in the 5.0/stable snap channel)?

As LXD 4.0 LTS is only in security support now, see:

FireLXC · November 28, 2022, 4:46pm

Are you certain that would fix the issue or is it just a suggestion to try LXD 5.0 LTS?
Do you got any detailed changelog for breaking changes, especially regarding the API.

I found only a few mentions, that should not be an issue, including the TLS 1.3.

“As for LXD 4.0, we’ll be releasing one last bugfix release as 4.0.10”
The current release is 4.0.9, according to this there will be still bugfixes.

tomp · November 29, 2022, 8:30am

Yes, perhaps, although AFAIK we have no timeline for that release currently (if it does occur).
It will still only contain critical bug fixes and so will not contain all of the fixes that the 5.0 LTS series receives.

I’m not certain it will fix it, but there have been a lot of changes to the VM support in the 5.0 series so it would be my first step anyway.

FireLXC · November 30, 2022, 5:05am

The upgrade to 5.x LTS nearly worked flawlessly however, the network seems to be broken.
Containers work again, however VM’s are running but have no IPv4 and IPv6 but even the IPv6 seems to be affected despite its assigned.

Guess there have been some breaking changes.
Need to look into that.

tomp · November 30, 2022, 8:41am

Please show ip a and ip r from host and inside one of the problem VMs.
Also please show lxc config show <instance> --expanded for one of the VMs as well as lxc network show <network> for the relevant network. Thanks

FireLXC · December 2, 2022, 12:14am

I checked the syslog again, the same error, however the VM’s are running now.
Regarding the IP’s , for some reason when I tested it, it didn’t really work, now it works nearly instantaneous, so good, but no idea why.

maybe the DHCP server was not giving out leases.

FireLXC · December 3, 2022, 12:56pm

Correction, something causes the VM after fresh creation to hang for minutes, since upgrading to LTS 5.0.

The VM gets an IPv6 assigned, IPv4 remains empty, IPv6 isn’t pingable either.
This wasn’t the case with 4.0, idk yet what is causing this.

If whatever makes it hang is done, it works normally.
Trying to execute any commands via LXD Agent results in the agent not running, so the machine didn’t even try to boot yet?

https://pastebin.com/raw/L3esAKx7

FireLXC · December 3, 2022, 2:00pm

root@node:~# lxc console container
To detach from the console, press: +a q

distrobuilder-0bda836b-c141-4b0f-bfad-06af1181f93d login:
distrobuilder-0bda836b-c141-4b0f-bfad-06af1181f93d login:

Interesting.
edit: Debian 11

tomp · December 3, 2022, 5:26pm

Probably because the first time it launched qemu didn’t start and templates didn’t apply. Do fresh vms start ok on 5.0.1?

tomp · December 3, 2022, 5:27pm

Btw we know what the issue was with the lxd 4.0 LTS snap now Cant start VM " Failed to run: forklimits limit=memlock:unlimited:unlimited" - #19 by tomp

FireLXC · December 3, 2022, 7:02pm

I always tested it with a fresh virtual machine.
However I fail to reproduce this on my local lxd instance.

Plus, it seems to happen randomly, on the target machine after upgrading to 5.0.
Can’t reproduce it other lxd instances either that I upgraded to 5.0 due to 4.9 broke on it with the same error.

Perfect clusterfuck.

Any chance that a locally cloud image, still needs to fetch resources from the interwebs while first boot? Something that could timeout and break?

tomp · December 3, 2022, 9:27pm

What image are you using?

FireLXC · December 3, 2022, 9:54pm

I just reproduced it with Debian 11, following with Debian 10 (cloud images x86, fresh from today), which I thought worked before.
I don’t think this is related to this issue here, that is another issues.

The reason why Is, I used to just let lxd handle the images, so it downloads them on demand from the public mirrors, however this used to suck especially from remote locations such as South Africa, South America. Hence this idea came up again, if these cloud images fetch anything remote, despite have them locally downloaded.

tomp · December 3, 2022, 9:57pm

I don’t know what you mean by cloud images.

FireLXC · December 3, 2022, 10:04pm

https://uk.lxd.images.canonical.com/ Variant: cloud

tomp · December 3, 2022, 10:05pm

So you don’t have internet access? I’m not getting your point I’m afraid.

FireLXC · December 3, 2022, 10:14pm

Of course there is internet connectivity, however, the connectivity to Europe, America isn’t good.
Most of the time when I tested, I can get 50-100kb/sec, sometimes more, to the LXD Image servers, hence I store them locally, for a more reliable deployment.

Downloading some images can take up to 60 minutes if I am absolute unlucky.
Hence my question, do the cloud images download anything?

Because this could explain the randomly failed deployments.
I am mainly using the cloud images because of cloud-init to resize the disk before boot etc.

Given the size, I thought they don’t really download anything but I could be wrong.

tomp · December 3, 2022, 10:20pm

So if the images are downloaded then instances can be created from them ok. But it sounds like something inside the image is intermittently not starting up.

If you’re wondering if thats due to the limited connectivity then I would check 2 things.

First use tcpdump -i lxdbr0 -nn to see if there is traffic going to or from the instance to the external network.

Also try using the non cloud image variant to try and zero in on the issue.