max107
(Maxim Falaleev)
January 23, 2026, 11:38am
1
We have next problem with incus in cluster mode.
We have 4 servers: 3 workers + 1 truenas as shared storage.
If one of workers down, vm’s stay in “Error” state without any migration into new server automatically. But, if manually migrate errored vm into another server, vm work without error, but that vm impossible to migrate to any other hosts.
We have two questions:
Why incus not migrate vm automatically when host system is down?
Why after manual migrations of vm impossible to migrate vm again to another host?
UP: i can’t find any logs about problems with migration, failover or errors in operation after one of worker down.
max107
(Maxim Falaleev)
January 23, 2026, 12:01pm
2
Oh, okay, i forgot to set cluster.healing_threshold greater than zero. But anyway, problem with manually moved vm still exists.
max107
(Maxim Falaleev)
January 23, 2026, 12:10pm
3
New symptoms:
Jan 23 15:03:59 m1-srv3 incusd[1098932]: time="2026-01-23T15:03:59+03:00" level=error msg="Failed migration on target" clusterMoveSourceName=demo3 err="Failed to run: truenas_incus_ctl share iscsi locate --target-prefix=incus --create --parsable zpool-shared/truenas_shared/virtual-machines/demo3.block: exit status 1 (Error: \niscsiadm: This command will remove the record [iface: default, target: iqn.2026-10.org.truenas.ctl:incus:zpool-shared:truenas-shared:virtual-machines:demo-block, portal: 10.11.10.249,3260], but a session is using it. Logout session then rerun command to remove record.\n\nThe remote iscsitarget service is running. It may need to be restarted with:\nservice restart iscsitarget)" instance=demo3 live=true project=default push=false
max107
(Maxim Falaleev)
January 23, 2026, 1:51pm
4
how broken migration looks like
id: 056659e9-95b1-4677-890d-1645e4d8bde7
class: task
description: Migrating instance
created_at: 2026-01-23T16:47:37.065246426+03:00
updated_at: 2026-01-23T16:50:51.880347451+03:00
status: Running
status_code: 103
resources:
instances:
- /1.0/instances/demo3
metadata:
live_migrate_instance_progress: 'Live migration: 1.08GB remaining (0B/s) (0% CPU
throttle)'
progress:
percent: "0"
processed: "806886"
speed: "0"
stage: live_migrate_instance
may_cancel: false
err: ""
location: m1-srv2
and we stuck, nothing migrated, operation impossible to cancel
stgraber
(Stéphane Graber)
January 23, 2026, 7:47pm
5
So that’s TrueNAS using the native TrueNAS storage driver?
max107
(Maxim Falaleev)
January 26, 2026, 5:30pm
7
after failed migration with failover scenario impossible to create new vm
So.. looks like iscsi + truenas is broken.
max107
(Maxim Falaleev)
January 27, 2026, 2:42pm
8
@stgraber small postmortem:
open-iscsi have bug iscsid: lost sessions and unable to logout · Issue #228 · open-iscsi/open-iscsi · GitHub with old history. If one server in cluster fail (instant shutdown, broken network) some iscsi sessions lost, some shutdown correctly.
When broken server turned on, incus can’t do anything with that vm with invalid “ghost” iscsi
stgraber
(Stéphane Graber)
January 27, 2026, 4:03pm
9
Ah, interesting. I wonder if allowing concurrent access to all the iscsi LUNs would help deal with that somehow. Though that’s all logic that’s outside of Incus and in the TrueNAS bridge code instead (GitHub - truenas/truenas_incus_ctl: TrueNAS CLI Admin tool for Incus and Consumers )
TrueNAS also supports NVME over TCP, so that could be an alternative.
max107
(Maxim Falaleev)
January 27, 2026, 7:50pm
10
stgraber:
TrueNAS also supports NVME over TCP, so that could be an alternative.
But incus is not support nvme over tcp, right?
stgraber
(Stéphane Graber)
January 27, 2026, 11:20pm
11
It doesn’t matter here. The project I linked to above is what interacts with TrueNAS, TrueNAS can export over either iSCSI or NVME-over-TCP. All Linux systems can connect to either iSCSI or NVME-over-TCP, so the project I linked to could be updated to use NVME-over-TCP without needing any actual change to the TrueNAS driver in Incus.
2 Likes
max107
(Maxim Falaleev)
January 28, 2026, 7:23am
12
without needing any actual change to the TrueNAS driver in Incus
out, err := d.runTool(args...)
_ = out
if err != nil {
return err
}
return nil
}
func (d *truenas) verifyIscsiFunctionality(ensureSetup bool) error {
args := []string{"--parsable"}
if ensureSetup {
args = append(args, "--setup")
}
_, err := d.runIscsiCmd("test", args...)
if err != nil {
return err
}
and yes, and no.
// the daemon *should've* re-opened the connection, but as of 0.7.2 it doesn't, re-trying should force the connection to be re-opened.
d.logger.Error("TrueNAS Tool POST failed with socket EOF, will retry", logger.Ctx{"err": err})
out, err = subprocess.RunCommand(tnToolName, args...)
}
// will allow us to prepend args
return out, err
}
// runIscsiCmd runs the supplied args against the tools `share iscsi` command whilst applying the appropriate iscsi global flags.
func (d *truenas) runIscsiCmd(cmd string, args ...string) (string, error) {
baseArgs := []string{"share", "iscsi", cmd}
baseArgs = append(baseArgs, "--target-prefix=incus")
if d.config["truenas.portal"] != "" {
baseArgs = append(baseArgs, "--portal", d.config["truenas.portal"])
}
if d.config["truenas.initiator"] != "" {
baseArgs = append(baseArgs, "--initiator", d.config["truenas.initiator"])
Without fork incus impossible to implement nvme over tcp only in truenas_incus_ctl project.
stgraber
(Stéphane Graber)
January 28, 2026, 7:29am
13
Ah, interesting, I thought that they had put all that stuff in the separate binary.
Anyway, most of the work would need to be done in the external tool, then once that’s done, the Incus change would be pretty trivial.
max107
(Maxim Falaleev)
January 28, 2026, 8:11am
14
Okay, but what about current truenas + iscsi status in incus project? It’s stable? For shure?