Incus unresponsive with high CPU usage after changing instance config

Incus version 6.6 on Debian 12

I was attempting to mount a host directory into an LXC container with write capability. I came across this article from 2017: Custom user mappings in LXD containers | Stéphane Graber's website

Per the article, I ran the following commands to add allow Incus to use my user ID (1000)

printf "incus:$(id -u):1\nroot:$(id -u):1\n" | sudo tee -a /etc/subuid
printf "incus:$(id -g):1\nroot:$(id -g):1\n" | sudo tee -a /etc/subgid

and added the following raw idmap and host directory mount to the config of my container:

raw.idmap: |-
  uid 1000 1000
  gid 1000 1000
...
files:
  path: /data/files
  readonly: "false"
  source: /host/path/files
  type: disk

After restarting Incus, the LXC container won’t start and Incus is completely unresponsive to commands. I can see multiple processes on the host called incusd --group incus-admin --logfile /var/log/incus/incusd.log that are consuming 100% CPU.

I’ve reverted the changes to /etc/subuid and /etc/subgid and restarted Incus (and rebooted the host) a couple times but Incus immediately gets back into the same state, being unresponsive and consuming all of the host’s CPU. There is no logging (/var/log/incus/incusd.log is empty.)

journalctl -u incus.service only yields the following:

Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCPv6 stateless on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCPv4-derived IPv6 names on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: router advertisement on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: IPv6 router advertisement enabled
Oct 19 22:39:28 <incus hostname> dnsmasq-dhcp[7603]: DHCP, sockets bound exclusively to interface incusbr0
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using only locally-known addresses for incus
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: reading /etc/resolv.conf
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using nameserver 192.168.66.4#53
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: using only locally-known addresses for incus
Oct 19 22:39:28 <incus hostname> dnsmasq[7603]: read /etc/hosts - 8 names

Any ideas? How can I troubleshoot this further given that incus is unresponsive and there’s no logging?

Some additional info:

I said before that incus is entirely unresponsive but that’s wrong- I have a couple instances configured to autostart on boot, and those instances do start successfully and appear to be running normally. However I can’t stop, start or remove any instances. The incus stop, incus start and incus rm commands just hang.

I also spotted some new logging in journalctl -u incus.service

Oct 20 10:50:22 <incus hostname> systemd[1]: Starting incus.service - Incus - Daemon...
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: started, version 2.89 cachesize 150
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: compile time options: IPv6 GNU-getopt DBus no-UBus i18n IDN2 DHCP DHCPv6 no-Lua TFTP conntrack ipset nftset auth cr>
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCP, IP range 10.33.200.2 -- 10.33.200.254, lease time 1h
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv6 stateless on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv4-derived IPv6 names on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: router advertisement on incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv6 stateless on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCPv4-derived IPv6 names on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: router advertisement on fd42:8a5c:ea2b:ef67::, constructed for incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: IPv6 router advertisement enabled
Oct 20 10:50:24 <incus hostname> dnsmasq-dhcp[86769]: DHCP, sockets bound exclusively to interface incusbr0
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using only locally-known addresses for incus
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: reading /etc/resolv.conf
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using nameserver 192.168.66.4#53
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: using only locally-known addresses for incus
Oct 20 10:50:24 <incus hostname> dnsmasq[86769]: read /etc/hosts - 8 names
Oct 20 11:00:22 <incus hostname> incusd[86705]: Error: Daemon still not running after 600s timeout (Daemon not ready yet)
Oct 20 11:00:22 <incus hostname> systemd[1]: incus.service: start-post operation timed out. Terminating.
Oct 20 11:00:22 <incus hostname> systemd[1]: incus.service: Control process exited, code=exited, status=1/FAILURE
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: State 'stop-sigterm' timed out. Killing.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Killing process 86704 (incusd) with signal SIGKILL.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Main process exited, code=killed, status=9/KILL
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Failed with result 'timeout'.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Unit process 86769 (dnsmasq) remains running after unit stopped.
Oct 20 11:00:52 <incus hostname> systemd[1]: Failed to start incus.service - Incus - Daemon.
Oct 20 11:00:52 <incus hostname> systemd[1]: incus.service: Consumed 10min 46.952s CPU time.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Scheduled restart job, restart counter is at 1.
Oct 20 11:00:53 <incus hostname> systemd[1]: Stopped incus.service - Incus - Daemon.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Consumed 10min 46.952s CPU time.
Oct 20 11:00:53 <incus hostname> systemd[1]: incus.service: Found left-over process 86769 (dnsmasq) in control group while starting unit. Ignoring.
Oct 20 11:00:53 <incus hostname> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

I tried running incus admin shutdown and that resulted in some logging in /var/log/incus/incusd.log:

time="2024-10-20T11:10:00-04:00" level=error msg="Failed to unshift" device=podman-data driver=disk err="Failed getting extended attributes from \"/var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py\": xattr.list /var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py: no such file or directory" instance=podman-runner path=/var/lib/incus/storage-pools/default/custom/default_podman-data project=default
time="2024-10-20T11:10:00-04:00" level=warning msg="Failed auto start instance attempt" attempt=1 err="Failed to start device \"podman-data\": Failed mounting volume: Failed shifting custom storage volume \"podman-data\" on storage pool \"default\": Failed getting extended attributes from \"/var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py\": xattr.list /var/lib/incus/storage-pools/default/custom/default_podman-data/storage/overlay/554f7648b3fb694d16b7dce79b28ae1688a4263bd885bfd47c12b1e9fa36a7a8/diff/root/.cache/uv/archive-v0/QcjIrKnqp0lQVi94P-usK/apyhiveapi/hive.py: no such file or directory" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:05-04:00" level=warning msg="Failed auto start instance attempt" attempt=2 err="Failed to create mount directory \"/var/lib/incus/storage-pools/default/containers/podman-runner\": mkdir /var/lib/incus/storage-pools/default/containers/podman-runner: no such file or directory" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:10-04:00" level=warning msg="Failed auto start instance attempt" attempt=3 err="Failed to update volatile idmap: Failed to set volatile config: Failed to begin transaction: sql: database is closed" instance=podman-runner maxAttempts=3 project=default
time="2024-10-20T11:10:10-04:00" level=warning msg="Failed to create instance autostart failure warning" err="Failed to begin transaction: sql: database is closed" instance=podman-runner project=default
time="2024-10-20T11:10:10-04:00" level=error msg="Failed to auto start instance" err="Failed to update volatile idmap: Failed to set volatile config: Failed to begin transaction: sql: database is closed" instance=podman-runner project=default
time="2024-10-20T11:10:10-04:00" level=error msg="Problem loading instances list" err="Failed to begin transaction: sql: database is closed"

So it looks like the instance storage got messed up when I changed the idmapping config, and that one instance is now causing the Incus daemon to hang.

How can I fix or safely remove that instance so the daemon can start properly?

Hmm, yeah, something looks a bit mad.

I’d recommend looking at dmesg for any kernel messages that are storage related.
You may also want to configure Incus to start with --verbose or even --debug so you can get a better idea of where it gets stuck.

If your Incus somehow is online enough to access incus config show, you may want to do incus config set core.debug_address=127.0.0.1:8444 which should then let you do curl http://127.0.0.1:8444/debug/pprof/goroutine?debug=2