LXD - Cannot delete VM with error

Hello guys, and one again congratulations for your wonderful work !

I’m facing an issue with my snap lxd installation. I think that after the latest lxd update (revision 18077) I noticed that “lxc list” just remained hanging without showing a list of the containers.
In the LXD log, the last message was

lvl=info msg=“Loading daemon configuration”

Tried reverting lxd to a previous release, but things got stuck so I forcibly rebooted the server.

After a reboot, lxc list successfully lists the containers (I also have 2 VM’s) and I noticed that one of them has an “ERROR” status.
I didn’t need it anyway, so I tried deleting it. That’s when the fun starts

root@gw:depozit/virtualwin # lxc delete vm
Error: The instance is currently running, stop it first or pass --force
root@gw:depozit/virtualwin # lxc delete vm --force
Error: Instance is running
root@gw:depozit/virtualwin # lxc delete vm --force-local --force
Error: Instance is running
root@gw:depozit/virtualwin # lxc stop vm
Error: dial unix /var/snap/lxd/common/lxd/logs/vm/qemu.monitor: connect: connection refused
root@gw:depozit/virtualwin # lxc stop --force vm
root@gw:depozit/virtualwin # lxc delete vm
Error: The instance is currently running, stop it first or pass --force

I ran out of ideas… :frowning:

Looks like a qemu issue, run ps aux | grep qemu, find the process and kill it with kill -9 PID, then delete should be happy.

Thank you for your reply !
Unfortunately I’ve also tried that :slight_smile:

This is the other VM that is ok :

| onlyoffice | RUNNING | 192.168.0.40 (eth0) | | VIRTUAL-MACHINE | 0 |

And the output of ps -ef | grep qemu:

lxd 1406 1 21 17:28 ? 00:05:39 /snap/lxd/18077/bin/qemu-system-x86_64 -S -name onlyoffice -uuid 1335069d-0fb4-4b4f-9e24-337d51f5f12f -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-reboot -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/onlyoffice/qemu.conf -pidfile /var/snap/lxd/common/lxd/logs/onlyoffice/qemu.pid -D /var/snap/lxd/common/lxd/logs/onlyoffice/qemu.log -chroot /var/snap/lxd/common/lxd/virtual-machines/onlyoffice -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd

Only one VM running…

root@gw:~facturi/virtualwin # lxc info vm
Name: vm
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/04/26 17:45 UTC
Status: Error
Type: virtual-machine
Profiles: vm
Pid: 4498
Resources:
Processes: 0

Can you show contents of ls -la /var/snap/lxd/common/lxd/logs/vm/ please

Sure !

drwx------ 2 root root 4096 Oct 29 22:42 .
drwx------ 14 root root 4096 Oct 30 17:27 …
-rw-r----- 1 root root 4724 Oct 27 22:40 qemu.conf
srwxr-x— 1 root root 0 Oct 27 22:40 qemu.monitor
-rw------- 1 root root 5 Oct 27 22:40 qemu.pid
srwxr-x— 1 root root 0 Oct 27 22:40 qemu.spice

There’s a good chance that deleting qemu.monitor in that dir will solve the issue, but I’m a bit confused as to why LXD doesn’t treat the dead socket as the VM being stopped.

Tried moving the qemu.monitor file out of the folder, somewhere in /tmp
Unfortunately lxd still sees the machine as running…

root@gw:logs/vm # lxc delete vm --force
Error: Instance is running
root@gw:logs/vm # ll
total 12K
-rw-r----- 1 root root 4.7K Oct 27 22:40 qemu.conf
-rw------- 1 root root 5 Oct 27 22:40 qemu.pid
srwxr-x— 1 root root 0 Oct 27 22:40 qemu.spice

VICTORY !

After also moving qemu.pid out of the folder (the PID inside didn’t correspond to anything running at this time), lxd deleted the VM.
Please let me know if I can offer you any other details so maybe it helps anyone in my situation or helps you debug this :slight_smile:

Can you show the output of cat qemu.pid and check that process ID doesn’t exist please.

1 Like

We’ve had instances where the socket is dead but the process lives on hence ERROR state rather than STOPPED, however in this case it looks like we need to give lxc stop the ability to detect a dead socket and dead process and cleanup the state files.

I suspect its related to https://github.com/lxc/lxd/pull/7966

I’ll try and recreate and put up a fix for that.

Well, seems that this was a situation the other way around, meaning the socket and pid existed, but the process with corresponding PID wasn’t in fact running

1 Like