Cant start VM " Failed to run: forklimits limit=memlock:unlimited:unlimited" LXD 4.0.9 LTS snap

vvolas · December 1, 2022, 8:23pm

LXD version 4.0.9 ubuntu 18 snap, clusterized. This was originally with windows VM now created test ubuntu VM.

Basically VM was rebooted after half year uptime and it cannot start any longer. Tried updating and rebooting 1 server didint help.

I kinda had a hunch that this might be related to bridged network like here Bridged networking on Ubuntu Server with systemd-networkd instead network-manager? - #6 by gunterze - Multipass - Ubuntu Community Hub ,

Dec 01 20:21:37 blazar-linux.int.o4.lt systemd-networkd[270475]: tap429e8661: Link DOWN
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: ERROR:Unknown interface index 104 seen even after reload
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: WARNING:Unknown index 104 seen, reloading interface list
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: ERROR:Unknown interface index 104 seen even after reload
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: WARNING:Unknown index 104 seen, reloading interface list
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: ERROR:Unknown interface index 104 seen even after reload
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: WARNING:Unknown index 104 seen, reloading interface list
Dec 01 20:21:37 blazar-linux.int.o4.lt networkd-dispatcher[298143]: ERROR:Unknown interface index 104 seen even after reload

so removed network profile later but problem persists.

lxc start ubuntu
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited – /snap/lxd/23991/bin/qemu-system-x86_64 -S -name ubuntu -uuid 44c9b9b0-7bca-4ccd-962d-5979b377907c -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsole
te=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/ubuntu/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/ubuntu/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/ubuntu/qemu.pid -D
/var/snap/lxd/common/lxd/logs/ubuntu/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : Process exited with non-zero value -1

lxc info --show-log ubuntu is doesnt have any logs

[Thu Dec 1 19:57:25 2022] rbd10: p1 p14 p15
[Thu Dec 1 19:57:25 2022] rbd: rbd10: capacity 35000000000 features 0x1
[Thu Dec 1 19:57:25 2022] rbd: rbd11: capacity 104857600 features 0x1
[Thu Dec 1 19:57:25 2022] EXT4-fs (rbd11): mounted filesystem with ordered data mode. Opts: discard
[Thu Dec 1 19:57:25 2022] ext4 filesystem being mounted at /var/snap/lxd/common/lxd/storage-pools/ceph-lxd/virtual-machines/ubuntu supports timestamps until 2038 (0x7fffffff)
[Thu Dec 1 19:57:25 2022] audit: type=1400 audit(1669925522.472:3156): apparmor=“STATUS” operation=“profile_replace” info=“same as current profile, skipping” profile=“unconfined” name=“lxd-ubuntu_</var/snap/lxd/common/lxd>” pid=394022 comm=“apparmor_parser”
[Thu Dec 1 19:57:25 2022] audit: type=1326 audit(1669925522.532:3157): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=394023 comm=“qemu-system-x86” exe=“/snap/lxd/23991/bin/qemu-system-x86_64” sig=31 arch=c000003e syscall=56 compat=0 ip=0x7f6ae64f7f3f code=0x80000000

Dec 01 20:18:03 blazar-linux.int.o4.lt systemd[3111092]: Started snap.lxd.lxc.236d5c2f-44bc-4b4f-b590-12720e906c2d.scope.
Dec 01 20:18:03 blazar-linux.int.o4.lt kernel:  rbd10: p1 p14 p15
Dec 01 20:18:03 blazar-linux.int.o4.lt kernel: rbd: rbd10: capacity 35000000000 features 0x1
Dec 01 20:18:03 blazar-linux.int.o4.lt kernel: rbd: rbd11: capacity 104857600 features 0x1
Dec 01 20:18:03 blazar-linux.int.o4.lt kernel: EXT4-fs (rbd11): mounted filesystem with ordered data mode. Opts: discard
Dec 01 20:18:03 blazar-linux.int.o4.lt kernel: ext4 filesystem being mounted at /var/snap/lxd/common/lxd/storage-pools/ceph-lxd/virtual-machines/ubuntu supports timestamps until 2038 (0x7fffffff)
Dec 01 20:18:04 blazar-linux.int.o4.lt audit[397773]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd-ubuntu_</var/snap/lxd/common/lxd>" pid=397773 comm="apparmor_parser"
Dec 01 20:18:04 blazar-linux.int.o4.lt kernel: audit: type=1400 audit(1669925884.044:3164): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd-ubuntu_</var/snap/lxd/common/lxd>" pid=397773 comm="apparmor_parser"
Dec 01 20:18:04 blazar-linux.int.o4.lt audit[397774]: SECCOMP auid=4294967295 uid=0 gid=0 ses=4294967295 pid=397774 comm="qemu-system-x86" exe="/snap/lxd/23991/bin/qemu-system-x86_64" sig=31 arch=c000003e syscall=56 compat=0 ip=0x7f103bf32f3f code=0x80000000
Dec 01 20:18:04 blazar-linux.int.o4.lt kernel: audit: type=1326 audit(1669925884.108:3165): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=397774 comm="qemu-system-x86" exe="/snap/lxd/23991/bin/qemu-system-x86_64" sig=31 arch=c000003e syscall=56 compat=0 ip=0x7f103bf32f3f code=0x80000000

tomp · December 1, 2022, 9:03pm

Do you get the same problem if you refresh to the 5.0/stable snap channel?

Mantas · December 2, 2022, 8:56am

We can’t update lxd from 4.0.9-eb5e237 (23991 4.0/stable) to 5.0/stable because this is production server with lots of virtual machines and containers running - we can’t allow downtime for other vm/containers
Do you have any other suggestions how to fix this important issue without refreshing lxd to 5.0 or restarting server ?

tomp · December 2, 2022, 10:41am

What does snap info lxd show?

I don’t think that LXD version has been updated for a long time, which then suggests something else has changed on that server.

But lets check.

tomp · December 2, 2022, 10:45am

Also just as a side note, the LXD 4.0 LTS series is only receiving security bug fixes now, so for continued general bug fix/environmental change support you need to be running the LXD 5.0 LTS series.

See Managing the LXD snap for more info about the different snap channels.

tomp · December 2, 2022, 10:45am

Do you know of anything that has changed on that server recently? Any updates applied?

tomp · December 2, 2022, 10:48am

Can new VMs be launched?

tomp · December 2, 2022, 11:00am

Looking at the snap change log it seems there were some cherry picks of dependency updates 10 days ago on the 22nd Nov, and the latest 4.0/stable package was built on 25th Nov so would include those changes.

I suspect the qemu change is the most likely candidate for the breakage.

I’ll have a look and see if I can recreate.

Please can you provide lxc config show <instance> --expanded and lxc storage show <pool> and lxc network show <network> for the relevant instance, pool and network.

Thanks

tomp · December 2, 2022, 11:19am

I just recreated the same issue on the LXD 4.0/stable channel:

snap install lxd --channel=4.0/stable
lxd init --auto
lxc launch images:ubuntu/focal v1 --vm
Creating v1
Starting v1                                   
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited -- /snap/lxd/23991/bin/qemu-system-x86_64 -S -name v1 -uuid d44bab24-1a63-4f5a-b072-d5eef160a1aa -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/v1/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/v1/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/v1/qemu.pid -D /var/snap/lxd/common/lxd/logs/v1/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : Process exited with non-zero value -1
Try `lxc info --show-log local:v1` for more info

lxc info --show-log local:v1
Name: v1
Location: none
Remote: unix://
Architecture: x86_64
Created: 2022/12/02 11:17 UTC
Status: Stopped
Type: virtual-machine
Profiles: default

Log:

tomp · December 2, 2022, 11:22am

What does sudo snap changes show for LXD? I wonder if we can get a past revision.

Mantas · December 2, 2022, 11:30am

New VMs can’t be launched - the same error
We have latest 4.0.9 release from 4.0/stable channel (updated about a week ago):
snap list |grep lxd
lxd 4.0.9-eb5e237 23991 4.0/stable canonical** in-cohort

lxc config show w10-terminal-ssd --expanded

architecture: x86_64
config:
  boot.autostart: "true"
  boot.autostart.priority: "195"
  limits.cpu: "8"
  limits.memory: 24GB
  security.secureboot: "false"
  volatile.eth0.hwaddr: 00:16:3e:8d:2a:6e
  volatile.last_state.power: STOPPED
  volatile.uuid: ccfe6235-1186-446c-9e15-a28ee1b2a21a
  volatile.vsock_id: "662"
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: ssd
    size: 240GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

lxc storage show ssd

config: {}
description: ""
name: ssd
driver: btrfs
used_by:
- /1.0/instances/vartai2ssd
- /1.0/instances/w10-terminal-ssd
- /1.0/instances/w10-unsql-ssd
- /1.0/profiles/default
- /1.0/profiles/no_net
status: Created
locations:
- paralel-linux
- universe-linux
- cluster-linux
- blazar-linux

lxc network show br0

config: {}
description: ""
name: br0
type: bridge
used_by:
- /1.0/instances/gitlab
- /1.0/instances/nextcloud-dev
- /1.0/instances/w10-grybas
- /1.0/instances/w10-terminal-ssd
- /1.0/instances/wiki
- /1.0/instances/zabbix
- /1.0/profiles/ceph-hdd
- /1.0/profiles/ceph-ssd
- /1.0/profiles/default
- /1.0/profiles/default_hdd
managed: false
status: ""
locations: []

snap changes
no changes found

snap changes lxd
no changes found

tomp · December 2, 2022, 11:42am

What does this show?

snap list lxd  --all

Mantas · December 2, 2022, 11:43am

I’ve test server (not connected to LXD cluster) with latest lxd 4.0.9 from 4.0/stable channel and the VMs doesn’t start there too. Then I’ve refreshed lxd in test server to 5.0/stable channel and issue is fixed in latest LXD 5.0 !!! I’m pasting snap changes output:
mantas@neutron-star:/# snap changes
ID Status Spawn Ready Summary
109 Error 8 days ago, at 08:17 UTC today at 07:21 UTC Auto-refresh snap “lxd”
110 Done today at 07:21 UTC today at 09:16 UTC Auto-refresh snaps “lxd”, “snapd”
111 Done today at 09:17 UTC today at 09:18 UTC Remove “lxd” snap
113 Done today at 09:20 UTC today at 09:20 UTC Install “lxd” snap from “4.0/stable” channel
114 Done today at 11:33 UTC today at 11:33 UTC Refresh “lxd” snap from “5.0/stable” channel

tomp · December 2, 2022, 11:44am

Yes I expected 5.0 LTS would work, as its similar/same as this issue:

Mantas · December 2, 2022, 11:44am

snap list lxd --all

Name  Version        Rev    Tracking    Publisher   Notes
lxd   4.0.9-8e2046b  22753  4.0/stable  canonical✓  disabled,in-cohort
lxd   4.0.9-eb5e237  23991  4.0/stable  canonical✓  in-cohort

tomp · December 2, 2022, 11:46am

Ah OK so you could start by increasing the number of revision kept so you dont lose a working one:

sudo snap set system refresh.retain=n

Then trying:

sudo snap revert lxd --revision <revision number>

As you’re running the LTS series, reverting should be possible as we dont include DB/API schema changes that would prevent reverting.

Mantas · December 2, 2022, 12:09pm

Just asking if other virtual machines and containers running on the same server will be restarted when I revert lxd ? AFAIK I should run this command, right?:

snap revert lxd --revision 22753

tomp · December 2, 2022, 12:14pm

No they shoudn’t be as this is the same as snap refresh that occurs automatically, only in the other direction.

However as you’re running a cluster you might need to do this on the other members. Although you’ll know if you need to because the snap refresh will pause waiting for the other members to arrive at same revision.

I think this shouldn’t be needed though as its just a minor snap revision and not a schema or API change.

tomp · December 2, 2022, 3:28pm

@stgraber is looking into this now, but we suspect the issue is the more recent QEMU version in the 4.0 LTS snap is causing a seccomp violation.

We think this commit also need to be backported into the LTS 4.0 series:

github.com/lxc/lxd

lxd/instance/qemu: Set spawn=allow

committed 11:07PM - 07 Mar 22 UTC

stgraber

+1 -1

spawn=allow is required when QEMU is asked to daemonize. Previous QEMU versions… would incorrectly block `fork` when spawn=deny was passed while allowing clone to succeed. This was then making it possible for us to use daemnize thanks to most Linux distributions using the clone syscall to implement fork. Current QEMU has fixed their seccomp profile to block both fork and clone, preventing -daemonize when spawn=deny is passed. Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>

macduff23 · May 2, 2023, 8:16am

Is there a fix for this? … trying follow this thread not sure where to go with it.
pfsense lxd VM from .iso per netgate doc.
.4.0-148-generic
VERSION=“20.04.1 LTS (Focal Fossa)”
lxd --version
5.13 --edge

Error: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 – /snap/lxd/24814/bin/qemu-system-x86_64 -S -name pfsense -uuid cfe6a9cf-52fa-42ba-a67e-e0a903b6d5fa -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/pfsense/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/pfsense/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/pfsense/qemu.pid -D /var/snap/lxd/common/lxd/logs/pfsense/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd -boot menu=on -machine pc-q35-2.6 -device virtio-vga -vnc :2 -drive file=/home/ubuntu/pfSense-CE-2.6.0-RELEASE-amd64.iso,index=0,media=cdrom,if=ide: qemu-system-x86_64: -vnc :2: VNC support is disabled
: Process exited with non-zero value 1
Try lxc info --show-log pfsense for more info N/A

thanks