VM instances not failing over when cluster node is evacuated

tuan · May 3, 2025, 10:33am

I have a 03 nodes cluster with following parameters set:

cluster.healing_threshold=10

When I reboot the node where a VM is running, the node is evacuated, however the VM instance is stopped and not starting on remaining cluster nodes.
Not sure if I am missing something?

stgraber · May 3, 2025, 6:05pm

Can you show a incus config show --expanded NAME for the instance?

Automatic recovery requires that the instance disk, network and all attached devices be available across the entire cluster. So if it’s using any local resource, it won’t come back up elsewhere.

tuan · May 5, 2025, 5:07am

Here it is:
architecture: x86_64
config:
limits.cpu: “4”
limits.memory: 32GiB
migration.stateful: “true”
volatile.cloud-init.instance-id: 38259100-a8fc-4cd6-9d08-0a97f5edd1d4
volatile.eth0.hwaddr: 10:66:6a:21:ce:46
volatile.last_state.power: RUNNING
volatile.last_state.ready: “false”
volatile.uuid: 8e78584b-5bcd-4972-be60-18606923c75d
volatile.uuid.generation: 8e78584b-5bcd-4972-be60-18606923c75d
volatile.vm.definition: pc-q35-9.0
volatile.vsock_id: “293501229”
devices:
eth0:
name: eth0
nictype: bridged
parent: vmbr0
type: nic
iso-volume:
boot.priority: “10”
pool: remote
source: talos-pve.iso
type: disk
root:
path: /
pool: remote
size: 128GiB
type: disk
ephemeral: false
profiles:

default
stateful: false
description: “”

stgraber · May 5, 2025, 2:50pm

I suspect the nictype=bridged is the issue, this isn’t an Incus managed network so Incus doesn’t know that this will be available on all servers, therefore making the instance unsuitable for automatic relocation during evacuation and during failure.

You could try setting cluster.evacuate to live-migrate which would override this behavior at least during normal incus cluster evacuate runs and which may also apply to the automatic recovery code path (which would then just turn into a regular migrate).