Vm is not automatically restarted on other node after node failure

Is there any dedicated config for behavior of vm after node failure ?
Currently after a node failure vm is moved to other node but not started automatically, just left in stopped state.
What needs to be done to make it to be started automatically like in classical HA ?
PS .Migration of this vm between hosts work as expected → vm is moved live

Two settings of vm and clsuter i have:
incus config get vm1 migration.stateful
true
incus config get cluster.healing_threshold
1

Thanks
Mat

cluster.healing_threshold is definitely the one you need for that.

So it looks like part of the automated recovery happened but not quite all of it.
Can you check /var/log/incus/incusd.log on the target server to see if there’s an error related to starting the instance back up?

Hi Stephane

Seems like nothing in logs related to instance. Just info about communication lost with host with ip x.x.x.x (this host was powered off to simulate issue with host)
y.y.y.y is a database-leader other host

time=“2024-07-10T10:20:55Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:20:59Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:12Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:22Z” level=warning msg=“Dqlite proxy failed” err=“first: remote → local: read tcp y.y.y.y:8443->x.x.x.x:38176: read: connection timed out” local=“y.y.y.y:8443” name=dqlite remote=“x.x.x.x:38176”
time=“2024-07-10T10:21:22Z” level=warning msg=“Dqlite proxy failed” err=“first: remote → local: read tcp y.y.y.y:8443->x.x.x.x:38192: read: connection timed out” local=“y.y.y.y:8443” name=dqlite remote=“x.x.x.x:38192”
time=“2024-07-10T10:21:23Z” level=warning msg=“Dqlite proxy failed” err=“first: local → remote: write tcp y.y.y.y:51882->x.x.x.x:8443: write: connection timed out” local=“y.y.y.y:51882” name=raft remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:23Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:30Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:41Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:49Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:05Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:08Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:17Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:33Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:42Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:54Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:03Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:15Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:19Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”

Regards
Mat

Okay, I’ll try to reproduce the issue here.
What storage are you using?

Thanks Stephane

Using iscsi configured on linux box.
Have even not one but two separate clusters with separate storage with similar behavior (both using iscsi)

So it’s iSCSI with the clustered LVM driver?

Yes.

incus storage list
±-----±-----------±------------±--------±--------+
| NAME | DRIVER | DESCRIPTION | USED BY | STATE |
±-----±-----------±------------±--------±--------+
| clvm | lvmcluster | | 4 | CREATED |
±-----±-----------±------------±--------±--------+

Managed to reproduce a failure here, looking into it now.

Incus 6.3 will have a fix. I confirmed that my LVM cluster test environment does properly recover instances now. You will most likely need to restart or live-migrate your instances once before they can properly auto-recover as the LVM lock mode needs to be changed and this is typically done during initial startup or migration.

Thanks Stephane !