[LXD] Automated cluster healing

Project LXD
Status Implemented
Author(s) @monstermunchkin
Approver(s) @stgraber @tomp
Release 5.14
Internal ID LX040

Abstract

This adds automated cluster member evacuation which migrates remote-backed instances if a cluster member is offline for a certain amount of time.

Rationale

Currently, offline cluster members are not automatically evacuated. This however would be beneficial if the offline cluster member has remote-backed instances. These can be migrated even if the member is offline.

Specification

Design

The following new configuration will be added:

  • cluster.healing_threshold

The automated cluster member evacuation can be enabled by setting the configuration key to a positive integer. If this value is lower than cluster.offline_threshold, that value will be used internally instead. This value represents the time is seconds after which an offline cluster member may be evacuated automatically.

If enabled, the cluster leader checks for offline members every minute, and evacuates those members if needed. Remote-backed instances are then migrated, and local instances are ignored as those cannot be migrated.

Once the offline member comes back online, it won’t be restored automatically. This needs to be done manually.

API changes

No API changes.

CLI changes

No CLI changes.

Database changes

No database changes.

Upgrade handling

No upgrade handling.

this works great, thanks for this new feature.

I have a question, there is an option for the automatic evacuated instances to autostart?, the parameter boot.autostart apparently not work for start automatically evacuated instances.

Thanks

If the instances were previously running they should be started upon being healed to another cluster member.

Hello, thanks for answer, I test it in laboratory, after evacuation, the instances keep in STOPED state.
Cluster:


Instances:

When force stop the node 2


after evacuation the instances keep in STOPPED state

this is a bug or there is a setting that I am missing?

Are you running the snap package? What is the version show in snap info lxd?

Hello, now I’m in the version 5.15, in this version the instances started correctly after evacuation, in the previous version 5.14, the instances evacuated but keep in STOPPED state, thank for the help, now is working really great

1 Like