Vm is not automatically restarted on other node after node failure

Mat · July 9, 2024, 12:21pm

Is there any dedicated config for behavior of vm after node failure ?
Currently after a node failure vm is moved to other node but not started automatically, just left in stopped state.
What needs to be done to make it to be started automatically like in classical HA ?
PS .Migration of this vm between hosts work as expected → vm is moved live

Two settings of vm and clsuter i have:
incus config get vm1 migration.stateful
true
incus config get cluster.healing_threshold
1

Thanks
Mat

stgraber · July 9, 2024, 7:04pm

cluster.healing_threshold is definitely the one you need for that.

So it looks like part of the automated recovery happened but not quite all of it.
Can you check /var/log/incus/incusd.log on the target server to see if there’s an error related to starting the instance back up?

Mat · July 10, 2024, 10:34am

Hi Stephane

Seems like nothing in logs related to instance. Just info about communication lost with host with ip x.x.x.x (this host was powered off to simulate issue with host)
y.y.y.y is a database-leader other host

time=“2024-07-10T10:20:55Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:20:59Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:12Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:22Z” level=warning msg=“Dqlite proxy failed” err=“first: remote → local: read tcp y.y.y.y:8443->x.x.x.x:38176: read: connection timed out” local=“y.y.y.y:8443” name=dqlite remote=“x.x.x.x:38176”
time=“2024-07-10T10:21:22Z” level=warning msg=“Dqlite proxy failed” err=“first: remote → local: read tcp y.y.y.y:8443->x.x.x.x:38192: read: connection timed out” local=“y.y.y.y:8443” name=dqlite remote=“x.x.x.x:38192”
time=“2024-07-10T10:21:23Z” level=warning msg=“Dqlite proxy failed” err=“first: local → remote: write tcp y.y.y.y:51882->x.x.x.x:8443: write: connection timed out” local=“y.y.y.y:51882” name=raft remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:23Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:30Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:41Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:21:49Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:05Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:08Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:17Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:33Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:42Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:22:54Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:03Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:15Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote=“x.x.x.x:8443”
time=“2024-07-10T10:23:19Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put "https://x.x.x.x:8443/internal/database\”: dial tcp x.x.x.x:8443: connect: no route to host" remote=“x.x.x.x:8443”

Regards
Mat

stgraber · July 10, 2024, 4:23pm

Okay, I’ll try to reproduce the issue here.
What storage are you using?

Mat · July 10, 2024, 5:15pm

Thanks Stephane

Using iscsi configured on linux box.
Have even not one but two separate clusters with separate storage with similar behavior (both using iscsi)

stgraber · July 10, 2024, 9:15pm

So it’s iSCSI with the clustered LVM driver?

Mat · July 11, 2024, 6:56am

Yes.

stgraber · July 11, 2024, 8:48pm

Managed to reproduce a failure here, looking into it now.

stgraber · July 11, 2024, 9:59pm

Incus 6.3 will have a fix. I confirmed that my LVM cluster test environment does properly recover instances now. You will most likely need to restart or live-migrate your instances once before they can properly auto-recover as the LVM lock mode needs to be changed and this is typically done during initial startup or migration.

Mat · July 12, 2024, 8:54am

Thanks Stephane !