Hi, I have launched Alpine/3.18 in a cluster. It runs up but I can’t stop it or delete it, even if I use --force. It just hangs there. It’s seems I have no way of deleting it without rebooting hosts. My hosts are Ubuntu 22.04 LTS and my LXD version is 5.19 installed via snap. Before you attempt to kill it, the Alpine seems to run ok. Doing lxc stop Alp hangs but doing lxc exec Alp – sh and then poweroff inside appears to work ok, but after the state is in error and you cannot delete the container. I would like to be able to remove it without having to reboot or force remove hosts from the cluster.
rich@rich:~$ lxc list
+------------------+---------+------+------+-----------------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+------------------+---------+------+------+-----------------+-----------+----------+
| Alp | ERROR | | | CONTAINER | 0 | matt.b |
+------------------+---------+------+------+-----------------+-----------+----------+
| z-Template-basic | STOPPED | | | CONTAINER | 0 | rich |
+------------------+---------+------+------+-----------------+-----------+----------+
| z-Template-win10 | STOPPED | | | VIRTUAL-MACHINE | 0 | rich |
+------------------+---------+------+------+-----------------+-----------+----------+
rich@rich:~$ lxc info Alp
(never comes back here)
Sounds like this may be a kernel hang due to io_uring.
You’d want to make sure that all updates are applied on your host and then reboot to be on the latest kernel possible.
If that doesn’t work out, see if your distro offers newer kernel builds (HWE kernels for Ubuntu for example).
rich@rich:~$ uname -a
Linux rich 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
rich@rich:~$
dmesg is massive and a lot has gone on since I listed that issue. Seems to be too big to paste into here. I will drop it in here after I have rebooted.
Thanks for your help mean time.
Okay, can you go through ps fauxww and see if the stuck container has a [lxc monitor] type process running, if it does, can you do cat /proc/PID/stack for that particular process?
Hi Stéphane
Sorry for delay. It kind of all got jammed up and heartbeats were lost and then quorum lost. I had to reboot the cluster member matt.b (that’s where the offending Alpine container last was). Before I rebooted I got the output below. Now I’ve had to do a…
lxd cluster recover-from-quorum-loss
That leaves me with one machine like this but still no way to get rid of the Alp (Alpine) container…
rich@rich:~$ lxc ls -c ndts4DmL
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| NAME | DESCRIPTION | TYPE | STATE | IPV4 | DISK USAGE | MEMORY USAGE | LOCATION |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| Alp | | CONTAINER | ERROR | | | | matt.b |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| Uboo | | CONTAINER | STOPPED | | 9.19MiB | | rich |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| z-Template-basic | To copy. basic inc... | CONTAINER | STOPPED | | 788.67MiB | | rich |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| z-Template-win10 | Template Windows 10 system | VIRTUAL-MACHINE | STOPPED | | 14.49GiB | | rich |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
rich@rich:~$ lxc delete Alp
Error: Failed checking instance exists "local:Alp": Missing event connection with target cluster member
rich@rich:~$ lxc delete Alp --force
Error: Failed checking instance exists "local:Alp": Missing event connection with target cluster member
rich@rich:~$ lxc mv Alp --target rich
Error: Migration API failure: Target cluster member is offline
rich@rich:~$
Hmm, your system has a bunch of processes stuck in I/O wait, including things that really shouldn’t get stuck in I/O wait like ip link set, this definitely looks like an unhappy kernel.
Have things gotten back to normal after the reboot?
If so, please make sure that you have all kernel updates applied, with a bit of luck that will fix whatever kernel issue caused this.
I have now finally managed to remove lxd from the node which was hosting the offending Alpine container. I could then reinstall it and redo lxd init and now the cluster is up without the offending Alpine container. Since I couldn’t stop lxd I removed it by finding likely process pids and killing (-9) them and then I finally got ‘snap remove lxd’ to work. Then I installed over again. The rest of the cluster remains ok. I reckon if I were to install an Alpine 3.18 again I would be back where I started. I’ve not tried other versions of Alpine. I now have a network access problem - I will raise separately. Thanks for your help.