After restarting Incus, the LXC container won’t start and Incus is completely unresponsive to commands. I can see multiple processes on the host called incusd --group incus-admin --logfile /var/log/incus/incusd.log that are consuming 100% CPU.
I’ve reverted the changes to /etc/subuid and /etc/subgid and restarted Incus (and rebooted the host) a couple times but Incus immediately gets back into the same state, being unresponsive and consuming all of the host’s CPU. There is no logging (/var/log/incus/incusd.log is empty.)
journalctl -u incus.service only yields the following:
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCPv6 stateless on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCPv4-derived IPv6 names on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: router advertisement on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: IPv6 router advertisement enabled
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCP, sockets bound exclusively to interface incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using only locally-known addresses for incus
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: reading /etc/resolv.conf
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using nameserver 192.168.66.4#53
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using only locally-known addresses for incus
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: read /etc/hosts - 8 names
Any ideas? How can I troubleshoot this further given that incus is unresponsive and there’s no logging?
I said before that incus is entirely unresponsive but that’s wrong- I have a couple instances configured to autostart on boot, and those instances do start successfully and appear to be running normally. However I can’t stop, start or remove any instances. The incus stop, incus start and incus rm commands just hang.
I also spotted some new logging in journalctl -u incus.service
Oct 20 10:50:22 <incus hostname> systemd[1]: Starting incus.service - Incus - Daemon...
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: started, version 2.89 cachesize 150
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: compile time options: IPv6 GNU-getopt DBus no-UBus i18n IDN2 DHCP DHCPv6 no-Lua TFTP conntrack ipset nftset auth cr>
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCP, IP range 10.33.200.2 -- 10.33.200.254, lease time 1h
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv6 stateless on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv4-derived IPv6 names on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: router advertisement on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv6 stateless on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv4-derived IPv6 names on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: router advertisement on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: IPv6 router advertisement enabled
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCP, sockets bound exclusively to interface incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using only locally-known addresses for incus
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: reading /etc/resolv.conf
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using nameserver 192.168.66.4#53
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using only locally-known addresses for incus
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: read /etc/hosts - 8 names
Oct 20 11:00:22 <incus hostname> incusd[86705]: Error: Daemon still not running after 600s timeout (Daemon not ready yet)
Oct 20 11:00:22 <incus hostname> systemd[1]: incus.service: start-post operation timed out. Terminating.
Oct 20 11:00:22 <incus hostname> systemd[1]: incus.service: Control process exited, code=exited, status=1/FAILURE
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: State 'stop-sigterm' timed out. Killing.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Killing process 86704 (incusd) with signal SIGKILL.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Main process exited, code=killed, status=9/KILL
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Failed with result 'timeout'.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Unit process 86769 (dnsmasq) remains running after unit stopped.
Oct 20 11:00:52 <incus hostname> systemd[1]: Failed to start incus.service - Incus - Daemon.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Consumed 10min 46.952s CPU time.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Scheduled restart job, restart counter is at 1.
Oct 20 11:00:53 <incus hostname> systemd[1]: Stopped incus.service - Incus - Daemon.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Consumed 10min 46.952s CPU time.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Found left-over process 86769 (dnsmasq) in control group while starting unit. Ignoring.
Oct 20 11:00:53 <incus hostname> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
I tried running incus admin shutdown and that resulted in some logging in /var/log/incus/incusd.log:
time="2024-10-20T11:10:00-04:00" level=error msg="Failed to unshift" device=podman-data driver=disk err="Failed getting extended attributes from \"/var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py\": xattr.list /var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py: no such file or directory" instance=podman-runner path=/var/lib/incus/storage-pools/default/custom/default_podman-data project=default
time="2024-10-20T11:10:00-04:00" level=warning msg="Failed auto start instance attempt" attempt=1 err="Failed to start device \"podman-data\": Failed mounting volume: Failed shifting custom storage volume \"podman-data\" on storage pool \"default\": Failed getting extended attributes from \"/var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py\": xattr.list /var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py: no such file or directory" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:05-04:00" level=warning msg="Failed auto start instance attempt" attempt=2 err="Failed to create mount directory \"/var/lib/incus/storage-pools/default/containers/podman-runner\": mkdir /var/lib/incus/storage-pools/default/containers/podman-runner: no such file or directory" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:10-04:00" level=warning msg="Failed auto start instance attempt" attempt=3 err="Failed to update volatile idmap: Failed to set volatile config: Failed to begin transaction: sql: database is closed" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:10-04:00" level=warning msg="Failed to create instance autostart failure warning" err="Failed to begin transaction: sql: database is closed" instance=podman-runner project=default
time="2024-10-20T11:10:10-04:00" level=error msg="Failed to auto start instance" err="Failed to update volatile idmap: Failed to set volatile config: Failed to begin transaction: sql: database is closed" instance=podman-runner project=default
time="2024-10-20T11:10:10-04:00" level=error msg="Problem loading instances list" err="Failed to begin transaction: sql: database is closed"
So it looks like the instance storage got messed up when I changed the idmapping config, and that one instance is now causing the Incus daemon to hang.
How can I fix or safely remove that instance so the daemon can start properly?
I’d recommend looking at dmesg for any kernel messages that are storage related.
You may also want to configure Incus to start with --verbose or even --debug so you can get a better idea of where it gets stuck.
If your Incus somehow is online enough to access incus config show, you may want to do incus config set core.debug_address=127.0.0.1:8444 which should then let you do curl http://127.0.0.1:8444/debug/pprof/goroutine?debug=2