Incus migration and failover

We have next problem with incus in cluster mode.

We have 4 servers: 3 workers + 1 truenas as shared storage.

If one of workers down, vm’s stay in “Error” state without any migration into new server automatically. But, if manually migrate errored vm into another server, vm work without error, but that vm impossible to migrate to any other hosts.

We have two questions:

  1. Why incus not migrate vm automatically when host system is down?
  2. Why after manual migrations of vm impossible to migrate vm again to another host?

UP: i can’t find any logs about problems with migration, failover or errors in operation after one of worker down.

Oh, okay, i forgot to set cluster.healing_threshold greater than zero. But anyway, problem with manually moved vm still exists.

New symptoms:

Jan 23 15:03:59 m1-srv3 incusd[1098932]: time="2026-01-23T15:03:59+03:00" level=error msg="Failed migration on target" clusterMoveSourceName=demo3 err="Failed to run: truenas_incus_ctl share iscsi locate --target-prefix=incus --create --parsable zpool-shared/truenas_shared/virtual-machines/demo3.block: exit status 1 (Error: \niscsiadm: This command will remove the record [iface: default, target: iqn.2026-10.org.truenas.ctl:incus:zpool-shared:truenas-shared:virtual-machines:demo-block, portal: 10.11.10.249,3260], but a session is using it. Logout session then rerun command to remove record.\n\nThe remote iscsitarget service is running. It may need to be restarted with:\nservice restart iscsitarget)" instance=demo3 live=true project=default push=false

how broken migration looks like

id: 056659e9-95b1-4677-890d-1645e4d8bde7
class: task
description: Migrating instance
created_at: 2026-01-23T16:47:37.065246426+03:00
updated_at: 2026-01-23T16:50:51.880347451+03:00
status: Running
status_code: 103
resources:
  instances:
  - /1.0/instances/demo3
metadata:
  live_migrate_instance_progress: 'Live migration: 1.08GB remaining (0B/s) (0% CPU
    throttle)'
  progress:
    percent: "0"
    processed: "806886"
    speed: "0"
    stage: live_migrate_instance
may_cancel: false
err: ""
location: m1-srv2

and we stuck, nothing migrated, operation impossible to cancel

So that’s TrueNAS using the native TrueNAS storage driver?

yes

after failed migration with failover scenario impossible to create new vm

So.. looks like iscsi + truenas is broken.

@stgraber small postmortem:

open-iscsi have bug iscsid: lost sessions and unable to logout · Issue #228 · open-iscsi/open-iscsi · GitHub with old history. If one server in cluster fail (instant shutdown, broken network) some iscsi sessions lost, some shutdown correctly.

When broken server turned on, incus can’t do anything with that vm with invalid “ghost” iscsi

Ah, interesting. I wonder if allowing concurrent access to all the iscsi LUNs would help deal with that somehow. Though that’s all logic that’s outside of Incus and in the TrueNAS bridge code instead (GitHub - truenas/truenas_incus_ctl: TrueNAS CLI Admin tool for Incus and Consumers)

TrueNAS also supports NVME over TCP, so that could be an alternative.

But incus is not support nvme over tcp, right?

It doesn’t matter here. The project I linked to above is what interacts with TrueNAS, TrueNAS can export over either iSCSI or NVME-over-TCP. All Linux systems can connect to either iSCSI or NVME-over-TCP, so the project I linked to could be updated to use NVME-over-TCP without needing any actual change to the TrueNAS driver in Incus.

2 Likes

without needing any actual change to the TrueNAS driver in Incus

and yes, and no.

Without fork incus impossible to implement nvme over tcp only in truenas_incus_ctl project.

Ah, interesting, I thought that they had put all that stuff in the separate binary.

Anyway, most of the work would need to be done in the external tool, then once that’s done, the Incus change would be pretty trivial.

Okay, but what about current truenas + iscsi status in incus project? It’s stable? For shure?