Incus migration and failover

max107 · January 23, 2026, 11:38am

We have next problem with incus in cluster mode.

We have 4 servers: 3 workers + 1 truenas as shared storage.

If one of workers down, vm’s stay in “Error” state without any migration into new server automatically. But, if manually migrate errored vm into another server, vm work without error, but that vm impossible to migrate to any other hosts.

We have two questions:

Why incus not migrate vm automatically when host system is down?
Why after manual migrations of vm impossible to migrate vm again to another host?

UP: i can’t find any logs about problems with migration, failover or errors in operation after one of worker down.

max107 · January 23, 2026, 12:01pm

Oh, okay, i forgot to set cluster.healing_threshold greater than zero. But anyway, problem with manually moved vm still exists.

max107 · January 23, 2026, 12:10pm

New symptoms:

Jan 23 15:03:59 m1-srv3 incusd[1098932]: time="2026-01-23T15:03:59+03:00" level=error msg="Failed migration on target" clusterMoveSourceName=demo3 err="Failed to run: truenas_incus_ctl share iscsi locate --target-prefix=incus --create --parsable zpool-shared/truenas_shared/virtual-machines/demo3.block: exit status 1 (Error: \niscsiadm: This command will remove the record [iface: default, target: iqn.2026-10.org.truenas.ctl:incus:zpool-shared:truenas-shared:virtual-machines:demo-block, portal: 10.11.10.249,3260], but a session is using it. Logout session then rerun command to remove record.\n\nThe remote iscsitarget service is running. It may need to be restarted with:\nservice restart iscsitarget)" instance=demo3 live=true project=default push=false

max107 · January 23, 2026, 1:51pm

how broken migration looks like

id: 056659e9-95b1-4677-890d-1645e4d8bde7
class: task
description: Migrating instance
created_at: 2026-01-23T16:47:37.065246426+03:00
updated_at: 2026-01-23T16:50:51.880347451+03:00
status: Running
status_code: 103
resources:
  instances:
  - /1.0/instances/demo3
metadata:
  live_migrate_instance_progress: 'Live migration: 1.08GB remaining (0B/s) (0% CPU
    throttle)'
  progress:
    percent: "0"
    processed: "806886"
    speed: "0"
    stage: live_migrate_instance
may_cancel: false
err: ""
location: m1-srv2

and we stuck, nothing migrated, operation impossible to cancel

stgraber · January 23, 2026, 7:47pm

So that’s TrueNAS using the native TrueNAS storage driver?

max107 · January 24, 2026, 2:37pm

yes

max107 · January 26, 2026, 5:30pm

after failed migration with failover scenario impossible to create new vm

So.. looks like iscsi + truenas is broken.

max107 · January 27, 2026, 2:42pm

@stgraber small postmortem:

open-iscsi have bug iscsid: lost sessions and unable to logout · Issue #228 · open-iscsi/open-iscsi · GitHub with old history. If one server in cluster fail (instant shutdown, broken network) some iscsi sessions lost, some shutdown correctly.

When broken server turned on, incus can’t do anything with that vm with invalid “ghost” iscsi

stgraber · January 27, 2026, 4:03pm

Ah, interesting. I wonder if allowing concurrent access to all the iscsi LUNs would help deal with that somehow. Though that’s all logic that’s outside of Incus and in the TrueNAS bridge code instead (GitHub - truenas/truenas_incus_ctl: TrueNAS CLI Admin tool for Incus and Consumers)

TrueNAS also supports NVME over TCP, so that could be an alternative.

max107 · January 27, 2026, 7:50pm

But incus is not support nvme over tcp, right?

stgraber · January 27, 2026, 11:20pm

It doesn’t matter here. The project I linked to above is what interacts with TrueNAS, TrueNAS can export over either iSCSI or NVME-over-TCP. All Linux systems can connect to either iSCSI or NVME-over-TCP, so the project I linked to could be updated to use NVME-over-TCP without needing any actual change to the TrueNAS driver in Incus.

max107 · January 28, 2026, 7:23am

without needing any actual change to the TrueNAS driver in Incus

github.com/lxc/incus

internal/server/storage/drivers/driver_truenas_utils.go

main


      
          
          	out, err := d.runTool(args...)
          	_ = out
          	if err != nil {
          		return err
          	}
          
          	return nil
          }
          
          func (d *truenas) verifyIscsiFunctionality(ensureSetup bool) error {
          	args := []string{"--parsable"}
          
          	if ensureSetup {
          		args = append(args, "--setup")
          	}
          
          	_, err := d.runIscsiCmd("test", args...)
          	if err != nil {
          		return err
          	}

and yes, and no.

github.com/lxc/incus

internal/server/storage/drivers/driver_truenas_utils.go

main


      
          		// the daemon *should've* re-opened the connection, but as of 0.7.2 it doesn't, re-trying should force the connection to be re-opened.
          		d.logger.Error("TrueNAS Tool POST failed with socket EOF, will retry", logger.Ctx{"err": err})
          		out, err = subprocess.RunCommand(tnToolName, args...)
          	}
          
          	// will allow us to prepend args
          	return out, err
          }
          
          // runIscsiCmd runs the supplied args against the tools `share iscsi` command whilst applying the appropriate iscsi global flags.
          func (d *truenas) runIscsiCmd(cmd string, args ...string) (string, error) {
          	baseArgs := []string{"share", "iscsi", cmd}
          
          	baseArgs = append(baseArgs, "--target-prefix=incus")
          
          	if d.config["truenas.portal"] != "" {
          		baseArgs = append(baseArgs, "--portal", d.config["truenas.portal"])
          	}
          
          	if d.config["truenas.initiator"] != "" {
          		baseArgs = append(baseArgs, "--initiator", d.config["truenas.initiator"])

Without fork incus impossible to implement nvme over tcp only in truenas_incus_ctl project.

stgraber · January 28, 2026, 7:29am

Ah, interesting, I thought that they had put all that stuff in the separate binary.

Anyway, most of the work would need to be done in the external tool, then once that’s done, the Incus change would be pretty trivial.

max107 · January 28, 2026, 8:11am

Okay, but what about current truenas + iscsi status in incus project? It’s stable? For shure?