lxc list is having a hard time and the VMs are using 1 cpu core at 100% with the VM’s up & running.
after lxc stop VM-01 --force, LXD no longer responds at all
I have to reset the server during the reboot.
This is only for Virtual Machines in LXD
Operating system on host and VM’s: 20.04 Latest patches on the host and VM;s
libnetplan0/focal-updates 0.103-0ubuntu5~20.04.6 amd64 [upgradable from: 0.103-0ubuntu5~20.04.5]
libxml2/focal-updates,focal-security 2.9.10+dfsg-5ubuntu0.20.04.2 amd64 [upgradable from: 2.9.10+dfsg-5ubuntu0.20.04.1] netplan.io/focal-updates 0.103-0ubuntu5~20.04.6 amd64 [upgradable from: 0.103-0ubuntu5~20.04.5]
Do you have any idea? I will try old backups now.
Quitting the VMs with --force no longer works either. The host hangs on reboot. Only a reset helps during shutdown. The containers are healthy and running.
Name: VM-01
Status: RUNNING
Type: virtual-machine
Architecture: x86_64
PID: 3834
Created: 2022/02/01 13:01 UTC
Last Used: 2022/03/15 17:36 UTC
Resources:
Processes: -1
Disk usage:
root: 11.46GiB
Log:
warning: tap: open vhost char device failed: Permission denied
Name Version Rev Tracking Publisher Notes
core20 20220304 1376 latest/stable canonical✓ base
lxd 4.24 22662 latest/stable canonical✓ -
snapd 2.54.4 15177 latest/stable canonical✓ snapd
@stgraber@tomp It is unfortunately due to version 4.24 in the latest/stable.
Please provide some info about the changes and the workaround to fix this bug.
+-------+--------+--------+-------------+---------+---------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
+-------+--------+--------+-------------+---------+---------+
| local | zfs | LXD | | 7 | CREATED |
+-------+--------+--------+-------------+---------+---------+
Please can you show output of lxc config show <instance> --expanded as I’m interested by the error warning: tap: open vhost char device failed: Permission denied.
Also please can you show the output of sudo dmesg | grep DENIED after trying to start one of the instances.
I see this message more often with VMs. I don’t think this is the problem.
Note: I did not grep after starting the VMs. I have to (hard)reset my node with chance of corruption with VM’s running. I prefer not to boot the VMs right now.
The VMs use 1 core at 100% after starting. The LXD process hangs after a while and it stops responding to commands. I didnt check the disk I/O after this behavior. I have 2 enterprise NVMe SSDs in the server. But I still lean towards the io_uring. Can I turn this off on a VM to test it?
uname -a:
Linux 5.4.0-104-generic #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I can’t disable the LXD snap either. The process hangs and therefore the server does not shut down with this issue. LXD is therefore in the way when shutting down the server, because it is not possible to kill a VM with --force. LXD freezes after the lxc stop VM-01 --force command.
I don’t have any special configurations on my LXD servers at all.
2 NVMe SSDs with each 1 ZFS partition in mirror. I use LUKS encryption on the SSDs.
Other than that it’s just a normal simple pool with no special configurations.
zpool status
pool: LXD
state: ONLINE
scan: scrub repaired 0B in 0 days 00:07:58 with 0 errors on Tue Mar 15 20:56:40 2022
config:
NAME STATE READ WRITE CKSUM
LXD ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
LXDNVME0 ONLINE 0 0 0
LXDNVME1 ONLINE 0 0 0
errors: No known data errors
LUKS is enabled on the ZFS partitions. Not on the OS partition.
This is standard LUKS encryption on the partition. I have to unlock this after a reboot on both SSDs with a passphrase. Usually I disable LXD before reboot and enable it after unlocking.
Which is weird as we run daily tests of this stuff on Focal using the snap for various network types (mine is failing just on normal bridged NIC type).
OK so the warning: tap: open vhost char device failed: Permission denied is a red herring, that shows up when everything works OK.
Interestingly I’ve recreated the problem when using ZFS ontop of LVM.
I’ve not tried it on ZFS direct to disk yet. Is this something you can try (i.e creating a ZFS pool without LUKS?).
I’ve also confirmed its working on on a loop backed ZFS pool (we already disable io_uring in this case).
disabling io_uring fixes it (via a custom build), but want to see if there is something else at play here.
Although generally io_uring seems to be very unreliable with QEMU on stacked storage layers.