incusOS hangs on starting incus (API unavailable)

Hello,

I have (had!) an incusOS system running with a few instances. In the past two months, the server crashed or became unresponsive a few times. I suspect a VM with a demanding job to be the cause (yet I thought the hypervisor would not go down if a VM went crazy :sweat_smile:) and I have not found yet how to discover what happened (through the incus client).

Today, the incusOS boots, it has a wireguard VPN that connects, but the API is not available (unable to connect - TCP). On the display attached on the server, it hangs on ‘Starting the incus service’:

When rebooting with ctrl-alt-del, there’s a brief error: I think it says something about lxc-fs but I could not read further.

I tried booting on a live USB, but they mainly miss zfs support, on lubuntu 25.10 I could install zfs, but it looked like it could not open the zfs dataset created by incusOS:

Also, I though the data were encrypted, but it looks like only the system partitions are encrypted, is this correct?

I really have no idea what to do now, any help much appreciated,

thanks,

slt

ZFS is encrypted through its native encryption, so you can see the zpool and can list the datasets but accessing any data would require the key.

It looks like your server has Incus unable to start and since Incus is what provides the API, you’re kinda stuck there.

From that live USB, you should be able to unlock the root disk which is partition 10 on the disk. You can use cryptsetup luksOpen /dev/nvme0n1p10 root-crypt, then can mount /dev/mapper/root-crypt /mnt to access the root disk.

If you’re able to do that, then it may be useful to extract the data from /var/log/journal, with that extracted and parsed, you should be able to get the actual Incus starting error and then we can figure out the best step to recover from it.

Thanks for your answer.

Ok for the encryption, that’s good to know.

Now, my problem is that the different live systems I tried all missed recent-enough zfs support. So I’m kinda stuck not being able to access the root disk.

I’ll inspect the journal tomorrow and report back.

But right now, I’m a bit concerned by the fact that I have no way to remotely access and debug the system when the service goes down.

Don’t take this the wrong way, I’m just sharing my thoughts and hoping for more inputs:

→ Is it actually better/safer to run incus from a somewhat minimal system? (e.g. debian with incus from the zabbly repository).
→ Also I read that zfs is more battle tested, but stuggling to find a live system to being able to access zfs pools (when something went wrong) makes me wonder if this it really a good idea for me :white_question_mark: .

=> If I wanted to migrate from incusOS to debian+incus on btrfs: what would be the key steps to “extracts” images, instances (I can easily re-create profiles and ACLs if need be) so I can re-import them on the new system?

Thanks in advance.

PS: I ask several things at once as I think we live/work in different timezones and I’d need to get the instances back up and running as soon as possible: I try to consider my options.

We’re using the latest stable upstream ZFS release.
I produce packages of that version for both Ubuntu LTS and Debian, available at GitHub - zabbly/zfs: OpenZFS builds · GitHub

We also have opened an issue about having a recovery mechanism for such situations per Emergency HTTPS listener to handle primary application startup failure · Issue #920 · lxc/incus-os · GitHub

Of course you could use a regular distribution and just run the Incus package on there, though IncusOS is Debian 13 with the Incus Zabbly packages, so whatever is preventing Incus from starting on IncusOS would have likely happened on there too, just with an easier way to get to a terminal to look at logs :slight_smile:

If you could get to a working Incus system, you could use incus export to get an export that can be imported to a different storage driver like btrfs.

Since that’s currently not really an option, getting ZFS 2.4.1 running in a live environment should let you access the data by passing through the decryption key from /var/lib/incus-os/ on the LUKS ext4 partition. Once ZFS is unlocked, you can manually mount the datasets somewhere and get the data out.

But there’s no trivial way to get it back into Incus from there. Your best bet would be to create empty instances in your new environment and then move the old data over it.

Hello again,

Inpecting the output sudo journalctl -D path/to/root-crypt/var/log/journal –boot revealed these important error messages (/var/lib/incus/backups/custom: no space left on device):

Mar 09 14:12:10 983b2896-fd82-0b8e-7569-0168e3d65c2c incusd[1458]: time=“2026-03-09T14:12:10Z” level=error msg=“Failed to start the daemon” err=“Failed to chmod storage dir "/var/lib/incus/backups/custom": chmod /var/lib/incus/backups/custom: no space left on device”
Mar 09 14:12:10 983b2896-fd82-0b8e-7569-0168e3d65c2c incusd[1458]: Error: Failed to chmod storage dir “/var/lib/incus/backups/custom”: chmod /var/lib/incus/backups/custom: no space left on device

I then wanted to install zfs from the zabbly repository on the live 25.10 ubuntu system but faced another error, it looks like the repository is not compatible with this version of Ubuntu (‘questing’) :

:sweat_smile: why is everything so complicated!? :face_with_peeking_eye:

I’m now preparing a debian live USB system which, I hope, would allow me to delete some unimportant instances from the zfs data partition:

=> any advice before I do?

Thanks!

EDIT: I’m on the debian live USB system, I still find the process a bit complicated, so here are my notes just in case someone else needs to do that:

  • add the zabbly’s zfs repository (adapt the DISTRO and ARCH keywords in the source file)
  • install ZFS, run: apt-get install openzfs-zfsutils openzfs-zfs-dkms openzfs-zfs-initramfs
  • install kernel headers: sudo apt-get install linux-headers-$(uname -r)
  • prepare zfs module: sudo apt-get install --reinstall zfs-dkms
  • load the module and iimport zfs pool: sudo modprobe zfs && sudo zpool import <poolname>
  • inspect data:
sudo zfs list -o name,used,available,referenced,mountpoint
sudo zfs list -t snapshot -o name,used,referenced,creation
  • delete uneeded snapshots (and possibly also containers): sudo zfs destroy <poolname/dataset@snapname>
  • export the zfs pool: sudo zpool export <poolname>
  • Reboot, enable secure boot, reboot and check!

Ah, that’s an annoying one to run into
 We have a bunch of logic to handle the root disk getting full, but clearly a ZFS pool getting full can be just as problematic (especially the local one).

@gibmat Should we add a background check to incus-osd, say every 5min or so and then issue a warning+modal if we find a pool with less than 1GiB of space left?

Hi
I hope it’s fine replay here. I can start a new threat if necessary.

We ran into (soft of) the same issue of only being 0.02GB of free space


This was while trying to test USB passthrough, which somehow caused the VM to go into an error state. After a reboot of the physical server the issue appeared.

Error state steps (from memory):

  1. Insert usb

  2. change instance config to add device as usb

    • device is visible without needing a reboot
  3. mount usb as type ntfs → works

  4. start copying files → copy hangs (can still run commands, but copy is frozen)

  5. stop copy operation

  6. reboot VM (rationale: maybe passthrough was not working correctly)

  7. reboot VM → VM in error state → cannot boot (no clear error message)

  8. delete instance

  9. Error while recreating instance ‘inserting volume’ → see below

    Failed creating instance from image: Error inserting volume "e9b82eedf3ea689a0c9c8363c7ceb844af9fb85fe17e647f30e2d3ab9e825f5e" for project "default" in pool "main_pool" of type "images" into database "UNIQUE constraint failed: index 'storage_volumes_unique_storage_pool_id_node_id_project_id_name_type'"
    
  10. Try to create instance after removing problematic image → ‘zfs suspend’ error while creating instance:

    Failed creating instance from image: Failed to run: zfs create -s -V 10737418240 -o volmode=none -o sync=disabled main_pool/images/e9b82eedf3ea689a0c9c8363c7ceb844af9fb85fe17e647f30e2d3ab9e825f5e.block: exit status 1 (cannot create 'main_pool/images/e9b82eedf3ea689a0c9c8363c7ceb844af9fb85fe17e647f30e2d3ab9e825f5e.block': pool I/O is currently suspended)
    
  11. reboot OS → no error

  12. Still cannot create instance → pool still in suspend

  13. Since no shell access I decide to try rebooting the physical server. After reboot:

    • IncusOS no longer managable

      • Web: inaccesible
      • CLI client:
        • (first): ‘Error: Get “https://xxx.xxx.xxx.xxx:8443/1.0”: EOF’
        • (afterwards): Unable to connect to: xxx.xxx.xxx.xxx:8443 ([dial tcp xxx.xxx.xxx.xxx:8443: connectex: No connection could be made because the target machine actively refused it.]
    • Server still pingable

I run into a few issues here, but case-in-point is that I decided to reboot the physical server and afterwards I run into the issue of free space and being unable to manage incus.

The IncusOS server is more or less a fresh install, with two storage pools and free space shouldn’t really be an issue:

  • local_zfs (default) for images, logs, etc. → had more the 200 GB of free space
  • main_pool for instances → has ca 3 TiB of free space

Some additional notes:

  • Before I rebooted the server I was running 202603030349
  • After the reboot it upgraded to: 202603081756 (same as OP)
  • Booting to the older version doesn’t fix the issue

I will probably reinstall the server, since there was not much on it and we are still testing IncusOS as a whole. However I suspect that maybe bug caused the disk space to be filled (maybe the upgrade?, logs?), since there was more then enough space and don’t see how else the local_zfs pool could be staturated so quickly.

Any ideas what could have caused this?

Kind regards
be3p9

Would definitely be good to boot a live media and use the recovery key to look around. From the screenshot you seem to be running out of disk space on the root partition. That’s very unusual as we don’t exactly store much data in there, even less now that we’re putting the Incus logs on the local pool.

It would be interesting to track down the space usage and attempt to clear things.
Most likely it’s either /var/lib/incus or /var/log/incus taking up the space.

On the ZFS side of things, a zfs list -t all from a live media with ZFS 2.4.1 (see previous post) would give you a good idea of what’s going on there.

The first error you showed could be explained by the root disk being out of space, but the suspend and lack of recovery are more confusing so it would be good to know what’s actually on there and how much space there currently is.

Alright. So I decided to not reinstall and troubleshoot.

I managed to boot into a live media and unlock the cryptsetup luksOpen and mount the root partition. Interestingly the partition is only 25G and most of it seems to be used by the main_pool.img disk file (?):

I’m still looking at the logs, however I can already say with high certainty that whatever caused the space to fill up. It was before the reboot/upgrade. Here what I’ve found so far:

This when I attached the usb device to the VM. I’m unsure if the mlx4_core message are relevant. So the USB was attached at 13:39:14 and some 20 minutes later I get the error Mar 10 13:53:56 redactedhost.intern kernel: WARNING: Pool 'main_pool' has encountered an uncorrectable I/O failure and has been suspended.

Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: new SuperSpeed USB device number 4 using xhci_hcd
Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: New USB device found, idVendor=2174, idProduct=2100, bcdDevice= 1.00
Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: Product: ESD310C
Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: Manufacturer: Transcend
Mar 10 13:39:14 redactedhost.intern kernel: usb 4-2: SerialNumber: F6326339I99725870024
Mar 10 13:39:14 redactedhost.intern kernel: scsi host17: uas
Mar 10 13:39:14 redactedhost.intern kernel: scsi 17:0:0:0: Direct-Access     ESD310C  TS256GESD310C    1000 PQ: 0 ANSI: 6
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: Attached scsi generic sg5 type 0
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: [sde] 500118192 512-byte logical blocks: (256 GB/238 GiB)
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: [sde] Write Protect is off
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: [sde] Mode Sense: 43 00 00 00
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Mar 10 13:39:14 redactedhost.intern kernel:  sde: sde1
Mar 10 13:39:14 redactedhost.intern kernel: sd 17:0:0:0: [sde] Attached SCSI disk
Mar 10 13:40:48 redactedhost.intern incusd[259051]: time="2026-03-10T14:40:48+01:00" level=warning msg="Failed getting exec control websocket reader, killing command" PID=0 err="websocket: close 1005 (no status)" instance=IncusManageVM interactive=true project=default
Mar 10 13:41:07 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:41:24 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:41:47 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:43:24 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:44:16 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:44:32 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:44:44 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:45:40 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:46:27 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:46:57 redactedhost.intern kernel: kauditd_printk_skb: 2 callbacks suppressed
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1400 audit(1773150417.876:1802): apparmor="DENIED" operation="open" class="file" profile="incus-IncusManageVM_</var/lib/incus>" name="/dev/bus/usb/" pid=309384 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=985 ouid=0
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1300 audit(1773150417.876:1802): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=7f6a0d38237b a2=90800 a3=0 items=0 ppid=1 pid=309384 auid=4294967295 uid=985 gid=985 euid=985 suid=985 fsuid=985 egid=985 sgid=985 fsgid=985 tty=(none) ses=4294967295 comm="qemu-system-x86" exe="/opt/incus/bin/qemu-system-x86_64" subj=incus-IncusManageVM_</var/lib/incus> key=(null)
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1327 audit(1773150417.876:1802): proctitle=2F6F70742F696E6375732F62696E2F71656D752D73797374656D2D7838365F3634002D53002D6E616D6500496E6375734D616E616765564D002D757569640030633032626663652D343232382D343961342D383430352D353938323062323237303061002D6461656D6F6E697A65002D63707500686F73742C68765F70617373
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1400 audit(1773150417.877:1803): apparmor="DENIED" operation="open" class="file" profile="incus-IncusManageVM_</var/lib/incus>" name="/dev/" pid=309384 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=985 ouid=0
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1300 audit(1773150417.877:1803): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=7f6a0d382388 a2=90800 a3=0 items=0 ppid=1 pid=309384 auid=4294967295 uid=985 gid=985 euid=985 suid=985 fsuid=985 egid=985 sgid=985 fsgid=985 tty=(none) ses=4294967295 comm="qemu-system-x86" exe="/opt/incus/bin/qemu-system-x86_64" subj=incus-IncusManageVM_</var/lib/incus> key=(null)
Mar 10 13:46:57 redactedhost.intern kernel: audit: type=1327 audit(1773150417.877:1803): proctitle=2F6F70742F696E6375732F62696E2F71656D752D73797374656D2D7838365F3634002D53002D6E616D6500496E6375734D616E616765564D002D757569640030633032626663652D343232382D343961342D383430352D353938323062323237303061002D6461656D6F6E697A65002D63707500686F73742C68765F70617373
Mar 10 13:47:10 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:47:26 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
Mar 10 13:50:09 redactedhost.intern kernel: mlx4_core 0000:81:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update

## I/O Error and call trace
Mar 10 13:53:56 redactedhost.intern kernel: WARNING: Pool 'main_pool' has encountered an uncorrectable I/O failure and has been suspended.
Mar 10 13:57:20 redactedhost.intern kernel: INFO: task zvol_tq-1:3455 blocked for more than 122 seconds.
INFO: task zvol_tq-1:3455 blocked for more than 122 seconds.
Mar 10 13:57:20 redactedhost.intern kernel:       Tainted: P           O        6.18.14-zabbly+ #debian13
Mar 10 13:57:20 redactedhost.intern kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 10 13:57:20 redactedhost.intern kernel: task:zvol_tq-1       state:D stack:0     pid:3455  tgid:3455  ppid:2      task_flags:0x288040 flags:0x00080000
Mar 10 13:57:20 redactedhost.intern kernel: Call Trace:
Mar 10 13:57:20 redactedhost.intern kernel:  <TASK>
Mar 10 13:57:20 redactedhost.intern kernel:  __schedule+0x468/0x1310
Mar 10 13:57:20 redactedhost.intern kernel:  schedule+0x27/0xf0
Mar 10 13:57:20 redactedhost.intern kernel:  io_schedule+0x4c/0x80
Mar 10 13:57:20 redactedhost.intern kernel:  cv_wait_common+0xb0/0x140 [spl]
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Mar 10 13:57:20 redactedhost.intern kernel:  __cv_wait_io+0x18/0x30 [spl]
Mar 10 13:57:20 redactedhost.intern kernel:  txg_wait_synced_flags+0xd9/0x160 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  dmu_tx_wait+0x249/0x460 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  dmu_tx_assign+0x3ce/0x490 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  zvol_write+0x212/0xa00 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  zvol_write_task+0x12/0x30 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  taskq_thread+0x34c/0x720 [spl]
Mar 10 13:57:20 redactedhost.intern kernel:  ? srso_return_thunk+0x5/0x5f
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_default_wake_function+0x10/0x10
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Mar 10 13:57:20 redactedhost.intern kernel:  kthread+0x10b/0x220
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_kthread+0x10/0x10
Mar 10 13:57:20 redactedhost.intern kernel:  ret_from_fork+0x1ec/0x220
Mar 10 13:57:20 redactedhost.intern kernel:  ? __pfx_kthread+0x10/0x10
Mar 10 13:57:20 redactedhost.intern kernel:  ret_from_fork_asm+0x1a/0x30
Mar 10 13:57:20 redactedhost.intern kernel:  </TASK>

After that I get a lot zfs errors, because the pool is suspended and IncusOS cannot run zfs commands. That’s when I decide to restart the OS through the Web UI (not the physical server)

## start of os shutdown 
Mar 10 14:57:29 redactedhost.intern incus-osd[1865]: 2026-03-10 15:57:29 INFO System is shutting down version=202603030349
Mar 10 14:57:29 redactedhost.intern incus-osd[1865]: 2026-03-10 15:57:29 INFO Stopping application name=incus version=202603081756
Mar 10 14:57:29 redactedhost.intern systemd[1]: Stopping incus-startup.service - Incus - Startup check...

## free space error during os shutdown 
Mar 10 14:57:31 redactedhost.intern systemd-networkd[15997]: uplink.124: Lost carrier
Mar 10 14:57:31 redactedhost.intern incusd[259051]: time="2026-03-10T15:57:31+01:00" level=warning msg="Failed to dump database file db.bin-wal: write /var/lib/incus/database/global/db.bin-wal: no space left on device"
Mar 10 14:57:31 redactedhost.intern systemd[1]: var-lib-incus-devices-IncusManageVM-config.mount.mount: Deactivated successfully.

## shutting down continues
Mar 10 14:57:29 redactedhost.intern incus-osd[1865]: 2026-03-10 15:57:29 INFO System is shutting down version=202603030349
Mar 10 14:57:29 redactedhost.intern incus-osd[1865]: 2026-03-10 15:57:29 INFO Stopping application name=incus version=20260308175

If I remember correctly it was during those 20 minutes where I started the copy operation between the usb device and the VM IncusManageVM. I should mention that the file I was trying to copy was around 30GB (usb → VM), so bigger then the 25GB root partition.

IncusManageVM was on the main_pool, so it shouldn’t take any space on root. right?
Besides I remember that the copy operation transfered around 59MB (checking through ls -laand du) and then it froze (presumably because of the suspend).

This was the device config for the instance:

devices:
  root:
    type: disk
    path: /
    pool: main_pool
    size: 500GiB
  eth0:
    type: nic
    network: uplink.666
  usb1:
    attached: 'true'
    busnum: '4'
    devnum: '4'
    type: usb

Okey.
So after looking a bit more into it, talking to a colleague about zfs and thinking about it

What most likely happend is that I erroneously created the main_pool on root (probably forgot to add a source). and while testing I filled the space with different images, instances and lastly with the copy operation.

If I had set up the pool correctly (on the 3.5 TiB disk), there shouldn’t even be a main_pool.img file under /var/lib/incus/disk

so.. user error.

Apologies for the spam and hijacking the thread.