How to achieve container autorestart on server failure

Andrei_Goldchleger · July 31, 2025, 9:16pm

My setup is the following:

Three physical servers running Ubuntu 24.04.2 LTS
Incus 6.14 installed from the Zabbly repository

Storage:

Ceph RBD for instances/Images
Three CephFS subvolumes, each attached to a different container

Instances:

Three ubuntu/24.04 containers, each running samba, serving one of the aforementioned CephFS subvolumes

Networking:

OVN for Incus networking
Keepalived for container virtual IPs
Dummy interfaces for the virtual IPs, so Incus can start even if the virtual ip is not yet available
Incus proxies mapping port 445 (SMB) between the OVN ip and the virtual ip

What I would like to achieve: in case of server failure, I would like to automatically “move” the instances that were running in the affected server to another server. I set cluster.healing_threshold to 20, and yanked the network cable from one of the servers. However, the container has not moved.

What am I missing? Do I need an extra piece of software to accomplish my goal?

Thanks.

stgraber · July 31, 2025, 10:21pm

I think the issue would be the proxy devices as those are treated as machine-specific and so make the instance be excluded from cross-server migrations.

What happens if you do incus cluster evacuate SERVER?
I’d expect the instances to just get stopped, not moved and restarted.

If that’s the case, then it’s the problem.
You should be able to force the behavior you want by setting cluster.evacuate config key on the instances (or through a profile). Setting that to migrate should then cause the instances to be stopped, moved and restarted elsewhere during evacuation.

The auto-healing is basically just a special case of the evacuation as far as we’re concerned, effectively an automatic evacuation of any instance where storage is available throughout the cluster and where the instance doesn’t depend on server-local resources. Setting cluster.evacuate to migrate should bypass most of the checks (except for storage) and have those get re-located on failure.

Andrei_Goldchleger · July 31, 2025, 10:51pm

Yes, you are correct. I just tested and this is exactly what happened.

Did that, and the container migrated to another server. Super cool, thank you!

So does cluster.healing_threshold play any role in the migration, or is it used for other purposes?

stgraber · August 1, 2025, 1:38am

cluster.healing_threshold controls when Incus will trigger an automatic evacuation of an offline server.

So setting it to 120 should have Incus recognize a server as properly dead after 2 minutes and then use the evacuation logic to relocate as many of the instances as possible.