Unhealthy deamon after lxd-to-incus (and going bullseye to bookworm)

I think I pushed my luck when migrating from lxd to incus :sweat_smile: I was running bullseye with lxd 5.0.x combined with an lvm thinpool, migrated to incus using the zabbly stable package with lxd-to-incus.

As this box runs my openwrt container and is headless, at some point during the migration I lost ssh access and wasn’t able to see the final messages. openwrt container didn’t come back, but after a powercycle everything was running smoothly. Migration seems to have worked flawlessly there, all containers came back up.

Feeling confident, I then triggered a bullseye to bookworm migration, after which incus seems less happy… Only 2 containers are booting (luckily openwrt being one of them), but my other containers don’t start and the daemon seems to be hanging (100% on 1 core).

Linux midge 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
root           2  0.0  0.0      0     0 ?        S    14:58   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [rcu_par_gp]
root           5  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [slub_flushwq]
root           6  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [netns]
root           8  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/0:0H-events_highpri]
root           9  0.6  0.0      0     0 ?        I    14:58   0:14  \_ [kworker/u8:0-events_unbound]
root          10  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [mm_percpu_wq]
root          11  0.0  0.0      0     0 ?        I    14:58   0:00  \_ [rcu_tasks_kthread]
root          12  0.0  0.0      0     0 ?        I    14:58   0:00  \_ [rcu_tasks_rude_kthread]
root          13  0.0  0.0      0     0 ?        I    14:58   0:00  \_ [rcu_tasks_trace_kthread]
root          14  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [ksoftirqd/0]
root          15  0.1  0.0      0     0 ?        I    14:58   0:04  \_ [rcu_preempt]
root          16  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [migration/0]
root          18  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [cpuhp/0]
root          19  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [cpuhp/1]
root          20  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [migration/1]
root          21  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [ksoftirqd/1]
root          23  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/1:0H-events_highpri]
root          24  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [cpuhp/2]
root          25  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [migration/2]
root          26  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [ksoftirqd/2]
root          28  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/2:0H-events_highpri]
root          29  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [cpuhp/3]
root          30  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [migration/3]
root          31  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [ksoftirqd/3]
root          33  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/3:0H-events_highpri]
root          35  0.1  0.0      0     0 ?        I    14:58   0:04  \_ [kworker/u8:1-dm-thin]
root          38  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [kdevtmpfs]
root          39  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [inet_frag_wq]
root          40  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [kauditd]
root          42  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [khungtaskd]
root          43  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [oom_reaper]
root          44  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [writeback]
root          45  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [kcompactd0]
root          46  0.0  0.0      0     0 ?        SN   14:58   0:00  \_ [ksmd]
root          47  0.0  0.0      0     0 ?        SN   14:58   0:00  \_ [khugepaged]
root          48  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kintegrityd]
root          49  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kblockd]
root          50  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [blkcg_punt_bio]
root          51  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [tpm_dev_wq]
root          52  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [edac-poller]
root          53  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [devfreq_wq]
root          55  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/1:1H-kblockd]
root          56  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [kswapd0]
root          63  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kthrotld]
root          65  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/122-aerdrv]
root          66  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/122-pcie-dpc]
root          67  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/123-aerdrv]
root          68  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/123-pcie-dpc]
root          69  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/124-aerdrv]
root          70  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/124-pcie-dpc]
root          71  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/125-aerdrv]
root          72  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/125-pcie-dpc]
root          73  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/126-aerdrv]
root          74  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [irq/126-pcie-dpc]
root          75  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [acpi_thermal_pm]
root          76  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [mld]
root          77  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/2:1H-kblockd]
root          78  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ipv6_addrconf]
root          83  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kstrp]
root          88  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [zswap-shrink]
root          89  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/u9:0]
root         142  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/0:1H-kblockd]
root         143  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kworker/3:1H-kblockd]
root         164  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ata_sff]
root         169  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [scsi_eh_0]
root         170  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [scsi_tmf_0]
root         172  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [scsi_eh_1]
root         173  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [scsi_tmf_1]
root         177  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [nvme-wq]
root         178  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [nvme-reset-wq]
root         179  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [nvme-delete-wq]
root         180  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [spi0]
root         185  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:0]
root         186  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:1]
root         193  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [dm_bufio_cache]
root         196  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:2]
root         197  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:3]
root         205  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:4]
root         206  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kcopyd]
root         207  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [dm-thin]
root         208  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:5]
root         239  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [jbd2/dm-0-8]
root         240  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ext4-rsv-conver]
root         390  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [watchdogd]
root         391  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [cryptd]
root         454  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [card0-crtc0]
root         455  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [card0-crtc1]
root         456  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [card0-crtc2]
root         459  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ext4-rsv-conver]
root         622  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [dio/dm-0]
root         642  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:6]
root         647  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [jbd2/dm-6-8]
root         648  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ext4-rsv-conver]
root         717  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:7]
root         776  0.0  0.0      0     0 ?        S    14:58   0:00  \_ [jbd2/dm-7-8]
root         777  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ext4-rsv-conver]
root         941  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [kdmflush/254:8]
root         986  0.3  0.0      0     0 ?        S    14:58   0:07  \_ [jbd2/dm-8-8]
root         987  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [ext4-rsv-conver]
root        1023  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [cfg80211]
root        2321  0.0  0.0      0     0 ?        I<   14:58   0:00  \_ [wg-crypt-wg0]
root        3604  0.1  0.0      0     0 ?        I    15:05   0:02  \_ [kworker/0:3-events]
root        4216  0.2  0.0      0     0 ?        I    15:15   0:03  \_ [kworker/1:0-events]
root        4218  0.0  0.0      0     0 ?        I    15:15   0:00  \_ [kworker/1:3-kdmflush/254:0]
root        4286  0.0  0.0      0     0 ?        I    15:15   0:01  \_ [kworker/3:2-mm_percpu_wq]
root        4364  0.3  0.0      0     0 ?        I    15:20   0:04  \_ [kworker/2:1-events]
root        4479  0.1  0.0      0     0 ?        I    15:31   0:00  \_ [kworker/0:1-kdmflush/254:0]
root        4509  0.0  0.0      0     0 ?        I    15:31   0:00  \_ [kworker/u8:6-dm-thin]
root        4512  0.0  0.0      0     0 ?        I    15:31   0:00  \_ [kworker/3:1-kdmflush/254:0]
root        4527  0.0  0.0      0     0 ?        I    15:31   0:00  \_ [kworker/2:2]
root        4545  0.0  0.0      0     0 ?        I    15:36   0:00  \_ [kworker/u8:2-ext4-rsv-conversion]
root           1  0.1  0.0 169136 13824 ?        Ss   14:58   0:03 /sbin/init
root         284  0.0  0.0  49376 16100 ?        Ss   14:58   0:00 /lib/systemd/systemd-journald
root         292  0.0  0.1  80580 25288 ?        SLsl 14:58   0:00 /sbin/dmeventd -f
root         308  0.0  0.0  26528  6756 ?        Ss   14:58   0:00 /lib/systemd/systemd-udevd
systemd+     483  0.0  0.0  90056  6716 ?        Ssl  14:58   0:00 /lib/systemd/systemd-timesyncd
root         509  0.0  0.0   6608  2728 ?        Ss   14:58   0:00 /usr/sbin/cron -f
message+     512  0.0  0.0   7912  3828 ?        Ss   14:58   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         525  0.0  0.0 221776  6048 ?        Ssl  14:58   0:00 /usr/sbin/rsyslogd -n -iNONE
root         527  0.0  0.0  24844  7972 ?        Ss   14:58   0:00 /lib/systemd/systemd-logind
root         601  0.0  0.0   5872  1072 tty1     Ss+  14:58   0:00 /sbin/agetty -o -p -- \u --noclear - linux
root         611  0.0  0.0  15412  8796 ?        Ss   14:58   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root        3148  0.0  0.0  17392 10668 ?        Ss   14:59   0:00  \_ sshd: bas [priv]
bas         3171  0.0  0.0  17652  6556 ?        S    14:59   0:00      \_ sshd: bas@pts/0
bas         3172  0.0  0.0   7196  3920 pts/0    Ss   14:59   0:00          \_ -bash
bas         3171  0.0  0.0  17652  6556 ?        S    14:59   0:00      \_ sshd: bas@pts/0
bas         3172  0.0  0.0   7196  3920 pts/0    Ss   14:59   0:00          \_ -bash
root        3175  0.0  0.0   9788  4660 pts/0    S+   14:59   0:00              \_ sudo -i
root        3176  0.0  0.0   9788   500 pts/1    Ss   14:59   0:00                  \_ sudo -i
root        3177  0.0  0.0   7196  4056 pts/1    S    14:59   0:00                      \_ -bash
root        3510  0.0  0.0   8004  3788 pts/1    S+   15:02   0:00                          \_ tmux
root         677  0.0  0.1 5933364 16836 ?       Ss   14:58   0:00 [lxc monitor] /var/lib/incus/containers caddy
root         855  0.0  0.1 5933364 19044 ?       Ss   14:58   0:00 [lxc monitor] /var/lib/incus/containers lumon
bas         3151  0.0  0.0  19072 10868 ?        Ss   14:59   0:00 /lib/systemd/systemd --user
bas         3152  0.0  0.0 168460  3148 ?        S    14:59   0:00  \_ (sd-pam)
root        3512  0.0  0.0  10112  5620 ?        Ss   15:02   0:01 tmux
root        3513  0.0  0.0   7336  4168 pts/2    Ss   15:02   0:00  \_ -bash
root        4548  0.0  0.0  11148  4384 pts/2    R+   15:37   0:00  |   \_ ps fauxww
root        3595  0.0  0.0   7308  4212 pts/3    Ss+  15:04   0:00  \_ -bash
root        4330  0.0  0.0 152788  1632 ?        Ssl  15:20   0:00 /opt/incus/bin/lxcfs /var/lib/incus-lxcfs
root        4480 99.8  1.1 6578200 188968 ?      Ssl  15:31   6:31 incusd --group incus-admin --logfile /var/log/incus/incusd.log
root        4481  0.3  0.3 5935028 54260 ?       Ssl  15:31   0:01 incusd waitready --timeout=600
● incus.service - Incus - Daemon
     Loaded: loaded (/lib/systemd/system/incus.service; indirect; preset: enabled)
     Active: activating (start-post) since Sat 2024-03-02 15:31:24 CET; 9min ago
TriggeredBy: ● incus.socket
   Main PID: 4480 (incusd); Control PID: 4481 (incusd)
      Tasks: 25
     Memory: 193.9M
        CPU: 9min 53.958s
     CGroup: /system.slice/incus.service
             ├─4480 incusd --group incus-admin --logfile /var/log/incus/incusd.log
             └─.control
               └─4481 incusd waitready --timeout=600

Mar 02 15:31:24 midge systemd[1]: Starting incus.service - Incus - Daemon...

Any suggestions for what I could try? triggered a reinstall of incus and incus-base, but to no avail.

(Really enjoyed the FOSDEM Incus session!)

Anything useful in journactl -u incus -n 300 or in /var/log/incus/incusd.log?
Maybe also check dmesg for good measure?

The two containers being seemingly stuck very early in boot definitely looks quite odd.

Is your lvs output looking as expected?

Ah, my initial post was a bit misleading I see, sorry about that - I stripped the child processes from the 2 running containers in the ps output, as they were behaving fine. The issue was the deamon never finishing starting the other containers and not being responsive to most commands.

lvs output showed all expected volumes, lsblk -pf made me realize the volume for a third container (minio) was mounted, yet the container itself never booted.

It made me suspect that this was the container which the daemon was choking on. A quick unmount and fsck showed no issues with the file system (daemon did remount it instantly).

As this was a volume with many separate files (minio instance with mastodon media cache), being a big difference from the other containers that did boot, I decided to purge the cache (eg: way fewer remaining files in volume).

That seems to have done the trick? :sweat_smile: All containers are now running flawlessly, daemon also happy.

Seems that it broke the cycle of the daemon getting killed after 600s timeout, it also yielded some sputtering about idmapped storage issues (this seems related to the parallel removing of files in the fs, done outside of incus) in the journalctl -u incus -n 300 output.

No meaningful log lines from before the file purging - could it be that it was simply taking longer than 600s to do the initial idmapping? (incusd had 1 core at 100%, did not see I/O blocking though)

Mar 03 09:12:14 midge incusd[10839]: Error: Daemon still not running after 600s timeout (Daemon not ready yet)
Mar 03 09:12:14 midge systemd[1]: incus.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 09:12:44 midge systemd[1]: incus.service: State 'stop-sigterm' timed out. Killing.
Mar 03 09:12:44 midge systemd[1]: incus.service: Killing process 10838 (incusd) with signal SIGKILL.
Mar 03 09:12:44 midge systemd[1]: incus.service: Main process exited, code=killed, status=9/KILL
Mar 03 09:12:44 midge systemd[1]: incus.service: Failed with result 'exit-code'.
Mar 03 09:12:44 midge systemd[1]: Failed to start incus.service - Incus - Daemon.
Mar 03 09:12:44 midge systemd[1]: incus.service: Consumed 10min 33.353s CPU time.
Mar 03 09:12:45 midge systemd[1]: incus.service: Scheduled restart job, restart counter is at 102.
Mar 03 09:12:45 midge systemd[1]: Stopped incus.service - Incus - Daemon.
Mar 03 09:12:45 midge systemd[1]: incus.service: Consumed 10min 33.353s CPU time.
Mar 03 09:12:45 midge systemd[1]: Starting incus.service - Incus - Daemon...
Mar 03 09:19:52 midge incusd[10924]: time="2024-03-03T09:19:52+01:00" level=warning msg="Failed auto start instance attempt" attempt=1 err="Failed to handle idmapped storage: Failed getting extended attributes f>
Mar 03 09:20:37 midge incusd[10924]: time="2024-03-03T09:20:37+01:00" level=warning msg="Failed auto start instance attempt" attempt=2 err="Failed to handle idmapped storage: Failed getting extended attributes f>
Mar 03 09:22:45 midge incusd[10925]: Error: Daemon still not running after 600s timeout (Daemon not ready yet)
Mar 03 09:22:45 midge systemd[1]: incus.service: start-post operation timed out. Terminating.
Mar 03 09:22:45 midge systemd[1]: incus.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 09:23:15 midge systemd[1]: incus.service: State 'stop-sigterm' timed out. Killing.
Mar 03 09:23:15 midge systemd[1]: incus.service: Killing process 10924 (incusd) with signal SIGKILL.
Mar 03 09:23:15 midge systemd[1]: incus.service: Main process exited, code=killed, status=9/KILL
Mar 03 09:23:15 midge systemd[1]: incus.service: Failed with result 'timeout'.
Mar 03 09:23:15 midge systemd[1]: Failed to start incus.service - Incus - Daemon.
Mar 03 09:23:15 midge systemd[1]: incus.service: Consumed 8min 48.409s CPU time.
Mar 03 09:23:15 midge systemd[1]: incus.service: Scheduled restart job, restart counter is at 103.
Mar 03 09:23:15 midge systemd[1]: Stopped incus.service - Incus - Daemon.
Mar 03 09:23:15 midge systemd[1]: incus.service: Consumed 8min 48.409s CPU time.
Mar 03 09:23:15 midge systemd[1]: Starting incus.service - Incus - Daemon...
Mar 03 09:24:04 midge incusd[11076]: time="2024-03-03T09:24:04+01:00" level=warning msg="Failed auto start instance attempt" attempt=1 err="Failed to handle idmapped storage: lstat /var/lib/incus/storage-pools/p>
Mar 03 09:24:44 midge incusd[11076]: time="2024-03-03T09:24:44+01:00" level=warning msg="Failed auto start instance attempt" attempt=2 err="Failed to handle idmapped storage: lstat /var/lib/incus/storage-pools/p>
Mar 03 09:25:29 midge incusd[11076]: time="2024-03-03T09:25:29+01:00" level=warning msg="Failed auto start instance attempt" attempt=3 err="Failed to handle idmapped storage: lstat /var/lib/incus/storage-pools/p>
Mar 03 09:25:29 midge incusd[11076]: time="2024-03-03T09:25:29+01:00" level=error msg="Failed to auto start instance" err="Failed to handle idmapped storage: lstat /var/lib/incus/storage-pools/pool/containers/mi>
Mar 03 09:25:34 midge systemd[1]: Started incus.service - Incus - Daemon.