Hello everyone,
I have a small LXD cluster with 3 nodes and local btrfs storage as an experiment. I just wanted to simulate, what happens if one of the nodes is failing. I don’t need distributed storage, the container backup and manual recovery is good enough for our purposes.
So I halted node02:
+------------------+---------------------------+-----------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------+
| m2cluster-node01 | https://192.168.64.6:8443 | database-leader | aarch64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------+
| m2cluster-node02 | https://192.168.64.7:8443 | database | aarch64 | default | | OFFLINE | No heartbeat for 4m31.479229085s (2023-04-22 17:39:42.305293643 +0000 UTC) |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------+
| m2cluster-node03 | https://192.168.64.8:8443 | database | aarch64 | default | | ONLINE | Fully operational |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------+
The container “deploy” running on node02 reports ERROR status:
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| admin | RUNNING | 192.168.64.14 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe9e:c48 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| backend | RUNNING | 192.168.64.15 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe29:c383 (eth0) | CONTAINER | 0 | m2cluster-node01 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| deploy | ERROR | | | CONTAINER | 0 | m2cluster-node02 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| public | RUNNING | 192.168.64.13 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe1c:e20 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
My intention was here to restore “deploy” from container backup.
So I tried different things before that:
- Evacuate node02:
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc cluster evacuate m2cluster-node02
Are you sure you want to evacuate cluster member "m2cluster-node02"? (yes/no) [default=no]: yes
Error: Failed to update cluster member state: Missing event connection with target cluster member
- Move the unavailable container to node01:
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc move deploy --target m2cluster-node01
Error: Failed loading instance storage pool: Failed getting instance storage pool name: Instance storage pool not found
- Delete the unavailable container from the cluster:
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc delete deploy
Error: Failed checking instance exists "local:deploy": Missing event connection with target cluster member
So I finally restored the container to node01 from backup:
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc import deploy.img
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc list
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| deploy | STOPPED | | | CONTAINER | 0 | m2cluster-node01 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc start deploy
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc list
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| deploy | RUNNING | 192.168.64.12 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fef3:b64e (eth0) | CONTAINER | 0 | m2cluster-node01 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
So far, so good. Then I started node02 back. Cluster member state was recovered, and the restored container is running on its new location, on node01.
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc cluster list
+------------------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| m2cluster-node01 | https://192.168.64.6:8443 | database-leader | aarch64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| m2cluster-node02 | https://192.168.64.7:8443 | database | aarch64 | default | | ONLINE | Fully operational |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| m2cluster-node03 | https://192.168.64.8:8443 | database | aarch64 | default | | ONLINE | Fully operational |
+------------------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc list
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| admin | RUNNING | 192.168.64.14 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe9e:c48 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| backend | RUNNING | 192.168.64.15 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe29:c383 (eth0) | CONTAINER | 0 | m2cluster-node01 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| deploy | RUNNING | 192.168.64.12 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fef3:b64e (eth0) | CONTAINER | 0 | m2cluster-node01 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| public | RUNNING | 192.168.64.13 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe1c:e20 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+---------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
Meanwhile, on node02 we have the old storage subvolume for container “deploy”:
[ rc0 ]-[root@m2cluster-node02]-[~] # btrfs su li /mnt/lxd
ID 257 gen 51 top level 5 path images/4ba589a5d05a4cc...
ID 271 gen 112 top level 5 path containers/web-dev_deploy
Now I tried to move the container to its original location, to node02:
[ rc0 ]-[root@m2cluster-node01]-[~] # lxc move deploy --target m2cluster-node02
Error: Rename instance operation failed: Rename instance: UNIQUE constraint failed: storage_volumes.storage_pool_id, storage_volumes.node_id, storage_volumes.project_id, storage_volumes.name, storage_volumes.type
[ rc1 ]-[root@m2cluster-node01]-[~] # lxc list
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| admin | RUNNING | 192.168.64.14 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe9e:c48 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| backend | RUNNING | 192.168.64.15 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe29:c383 (eth0) | CONTAINER | 0 | m2cluster-node01 |
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| lxd-move-of-869f0171-403e-4950-85d5-624016f6faf7 | STOPPED | | | CONTAINER | 0 | m2cluster-node02 |
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
| public | RUNNING | 192.168.64.13 (eth0) | fdad:f3da:86ea:f4b3:216:3eff:fe1c:e20 (eth0) | CONTAINER | 0 | m2cluster-node03 |
+--------------------------------------------------+---------+----------------------+-----------------------------------------------+-----------+-----------+------------------+
[ rc0 ]-[root@m2cluster-node02]-[~] # btrfs su li /mnt/lxd
ID 257 gen 51 top level 5 path images/4ba589a5d05a4cc...
ID 271 gen 112 top level 5 path containers/web-dev_deploy
ID 272 gen 120 top level 5 path containers/web-dev_lxd-move-of-869f0171-403e-4950-85d5-624016f6faf7
I can’t rename the container back to its original name, even after deleting the storage subvolume. It seems the database must be cleaned up somehow.
Is there a better procedure for the node failover? Should I simple wipe all the LXD content, create a new node and join the cluster?
Thank you very much - and sorry for the long description.