[LXD] Automated cluster healing

Project LXD
Status Implemented
Author(s) @monstermunchkin
Approver(s) @stgraber @tomp
Release 5.14
Internal ID LX040

Abstract

This adds automated cluster member evacuation which migrates remote-backed instances if a cluster member is offline for a certain amount of time.

Rationale

Currently, offline cluster members are not automatically evacuated. This however would be beneficial if the offline cluster member has remote-backed instances. These can be migrated even if the member is offline.

Specification

Design

The following new configuration will be added:

  • cluster.healing_threshold

The automated cluster member evacuation can be enabled by setting the configuration key to a positive integer. If this value is lower than cluster.offline_threshold, that value will be used internally instead. This value represents the time is seconds after which an offline cluster member may be evacuated automatically.

If enabled, the cluster leader checks for offline members every minute, and evacuates those members if needed. Remote-backed instances are then migrated, and local instances are ignored as those cannot be migrated.

Once the offline member comes back online, it won’t be restored automatically. This needs to be done manually.

API changes

No API changes.

CLI changes

No CLI changes.

Database changes

No database changes.

Upgrade handling

No upgrade handling.