Container in cluster locked up, can't delete or info or do anything with it

Dickie · December 8, 2023, 9:32am

Hi, I have launched Alpine/3.18 in a cluster. It runs up but I can’t stop it or delete it, even if I use --force. It just hangs there. It’s seems I have no way of deleting it without rebooting hosts. My hosts are Ubuntu 22.04 LTS and my LXD version is 5.19 installed via snap. Before you attempt to kill it, the Alpine seems to run ok. Doing lxc stop Alp hangs but doing lxc exec Alp – sh and then poweroff inside appears to work ok, but after the state is in error and you cannot delete the container. I would like to be able to remove it without having to reboot or force remove hosts from the cluster.

rich@rich:~$ lxc list
+------------------+---------+------+------+-----------------+-----------+----------+
|       NAME       |  STATE  | IPV4 | IPV6 |      TYPE       | SNAPSHOTS | LOCATION |
+------------------+---------+------+------+-----------------+-----------+----------+
| Alp              | ERROR   |      |      | CONTAINER       | 0         | matt.b   |
+------------------+---------+------+------+-----------------+-----------+----------+
| z-Template-basic | STOPPED |      |      | CONTAINER       | 0         | rich     |
+------------------+---------+------+------+-----------------+-----------+----------+
| z-Template-win10 | STOPPED |      |      | VIRTUAL-MACHINE | 0         | rich     |
+------------------+---------+------+------+-----------------+-----------+----------+
rich@rich:~$ lxc info Alp
(never comes back here)

Any help much appreciated.

stgraber · December 8, 2023, 4:56pm

Can you show uname -a and the output of dmesg?

Sounds like this may be a kernel hang due to io_uring.
You’d want to make sure that all updates are applied on your host and then reboot to be on the latest kernel possible.

If that doesn’t work out, see if your distro offers newer kernel builds (HWE kernels for Ubuntu for example).

stgraber · December 8, 2023, 4:56pm

Also:

Dickie · December 8, 2023, 5:18pm

uname looks like this

rich@rich:~$ uname -a
Linux rich 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
rich@rich:~$

dmesg is massive and a lot has gone on since I listed that issue. Seems to be too big to paste into here. I will drop it in here after I have rebooted.
Thanks for your help mean time.

stgraber · December 8, 2023, 5:21pm

Okay, can you go through ps fauxww and see if the stuck container has a [lxc monitor] type process running, if it does, can you do cat /proc/PID/stack for that particular process?

Dickie · December 10, 2023, 9:47am

Hi Stéphane
Sorry for delay. It kind of all got jammed up and heartbeats were lost and then quorum lost. I had to reboot the cluster member matt.b (that’s where the offending Alpine container last was). Before I rebooted I got the output below. Now I’ve had to do a…

lxd cluster recover-from-quorum-loss

That leaves me with one machine like this but still no way to get rid of the Alp (Alpine) container…

rich@rich:~$ lxc ls -c ndts4DmL
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
|       NAME       |        DESCRIPTION         |      TYPE       |  STATE  | IPV4 | DISK USAGE | MEMORY USAGE | LOCATION |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| Alp              |                            | CONTAINER       | ERROR   |      |            |              | matt.b   |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| Uboo             |                            | CONTAINER       | STOPPED |      | 9.19MiB    |              | rich     |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| z-Template-basic | To copy. basic inc...      | CONTAINER       | STOPPED |      | 788.67MiB  |              | rich     |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
| z-Template-win10 | Template Windows 10 system | VIRTUAL-MACHINE | STOPPED |      | 14.49GiB   |              | rich     |
+------------------+----------------------------+-----------------+---------+------+------------+--------------+----------+
rich@rich:~$ lxc delete Alp
Error: Failed checking instance exists "local:Alp": Missing event connection with target cluster member
rich@rich:~$ lxc delete Alp --force
Error: Failed checking instance exists "local:Alp": Missing event connection with target cluster member
rich@rich:~$ lxc mv Alp --target rich
Error: Migration API failure: Target cluster member is offline
rich@rich:~$

The output requested, as far as it goes…

matt@Matt-Desktop:~$ ps fauxww|head -1 && ps fauxww|tail -14
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1470  0.0  0.0 242220  8704 ?        Ssl  Nov20   0:00 /usr/libexec/upowerd
kernoops    1568  0.0  0.0  13084  2196 ?        Ss   Nov20   0:02 /usr/sbin/kerneloops --test
kernoops    1570  0.0  0.0  13084  2072 ?        Ss   Nov20   0:02 /usr/sbin/kerneloops
root        1573  0.2  0.6 1655972 131988 ?      Sl   Nov20  54:17 /opt/teamviewer/tv_bin/teamviewerd -d
root        1624  0.0  0.2 373252 43700 ?        Dsl  Nov20   1:10 /usr/libexec/packagekitd
uuidd      22898  0.0  0.0  11796  3072 ?        Ss   Nov23   0:00 /usr/sbin/uuidd --socket-activation
root       96078  0.0  0.0   2888  1664 ?        Ss   Dec05   0:00 /bin/sh /snap/lxd/26200/commands/daemon.start
root       96262  0.6  0.6 7636140 124060 ?      Sl   Dec05  34:27  \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd        98387  0.0  0.0  10464  4480 ?        Ss   Dec05   0:00      \_ dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdfan0 --dhcp-rapid-commit --no-negcache --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=240.10.0.1 --dhcp-no-override --dhcp-authoritative --dhcp-option-force=26,1450 --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdfan0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdfan0/dnsmasq.hosts --dhcp-range 240.10.0.2,240.10.0.254,1h -s lxd --interface-name _gateway.lxd,lxdfan0 -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdfan0/dnsmasq.raw -u lxd -g lxd
root      134403  0.0  0.0   4544  2304 ?        D    Dec07   0:00      \_ ip link set dev vethe2c5ef93 nomaster
root       96251  0.0  0.0 227920  2304 ?        Sl   Dec05   0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root      121471  0.0  0.0  73628 14464 ?        Ss   Dec07   0:00 /usr/sbin/cupsd -l
root      121478  0.0  0.0 172668 11904 ?        Dsl  Dec07   0:00 /usr/sbin/cups-browsed
root      133517  0.0  0.0 5855568 15628 ?       Ds   Dec07   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers Alp
matt@Matt-Desktop:~$ sudo cat /proc/133517/stack
^C^C^C
matt@Matt-Desktop:~$

stgraber · December 11, 2023, 6:05am

Hmm, your system has a bunch of processes stuck in I/O wait, including things that really shouldn’t get stuck in I/O wait like ip link set, this definitely looks like an unhappy kernel.

Have things gotten back to normal after the reboot?

If so, please make sure that you have all kernel updates applied, with a bit of luck that will fix whatever kernel issue caused this.

Dickie · December 13, 2023, 8:56am

I have now finally managed to remove lxd from the node which was hosting the offending Alpine container. I could then reinstall it and redo lxd init and now the cluster is up without the offending Alpine container. Since I couldn’t stop lxd I removed it by finding likely process pids and killing (-9) them and then I finally got ‘snap remove lxd’ to work. Then I installed over again. The rest of the cluster remains ok. I reckon if I were to install an Alpine 3.18 again I would be back where I started. I’ve not tried other versions of Alpine. I now have a network access problem - I will raise separately. Thanks for your help.

stgraber · December 13, 2023, 4:13pm

Glad to hear that you got things back online!

Do note that following the recent actions from Canonical around LXD:

We really can’t be providing support to LXD users on this forum anymore.

You may want to consider switching to Incus instead, or if you’d like to stay on LXD, you should reach out on the Canonical forum instead.

Sorry about that!