Hi, one of my machines has developed a problem whereby LXD will stop working —
lxc list will hang indefinitely with no output. In fact, almost any command will hang (
lxc launch foo,
lxc stop hoge etc).
Rebooting fixes it (although the shutdown takes an extra 10 minutes, because of:
A stop job is running for the Service for snap application lxd.daemon)
I don’t see anything that jumps out at me when trying
journalctl -u snap.lxd.daemon -n 3000… the last line is
Feb 11 15:17:04 ubuttnu1 lxd.daemon: => LXD is ready which is a couple hours ago, right after the last reboot.
dmesg -T (I’m typing this on the machine where LXD has hung, but before rebooting it)… not sure what I am looking for, but I don’t see anything in there that doesn’t appear in
journalctl -b -3 (3 reboots ago, when I used LXD normally and it didn’t hang).
Are there any tips on how to debug this kind of occurrence? I have 4 machines dedicated to LXD (not a cluster though, just a small lab with various VMs), and LXD is working normally on the other 3.
All these machines are running VMs doing PCI device passthrough. They all run Ubuntu 22.04 LTS and the feature release LXD snap package (5.10). They all use ZFS storage.
Unfortunately, the hardware is heterogenous, and this machine is passing through a different USB card and GPU than the others, so… it could well be some kind of voodoo problem that only happens on this hardware… There are no apparent problems with the host system itself: 0% CPU, 20+ GB free RAM out of 64GB, plenty of space on the filesystems.
So far, the only common thing is that the hangs happen after rebooting the same VM — an archlinux VM created from the default archlinux image. My next step will be create a similar VM based on some other Linux and do the same things to it and see if I can get the problem to recur.
Are there any other things I should be looking at to debug why this is happening?
Somewhat ironically, one thing I am doing with these machines is making a training video for how to wipe a physical machine and set it up as an LXD host, so nuking this machine and reinstalling the host OS and LXD from scratch isn’t really a problem.
But I would like to learn what I can about why this occurs, before doing so.
Thanks for any tips!