Recovering cluster after update

I am running a 6 node cluster. About a month ago, one of our nodes needed repairs and went offline. This was not an issue until today I noticed that my cluster was in a blocked state, following the latest snap update to LXD5.21 where the offline node was the only one still on 5.20. After reading the docs a bit closer (How to manage a cluster - LXD documentation), I now see I should have removed that node from the cluster.

I am unsure how to go about recovering my cluster, is it possible to somehow remove the offline node from the cluster without access to lxc cluster remove as all the lxc commands hang?

Here are the nodes in question, (glf-science-5 is the offline one):

stoyelq@glf-science-1:~$ lxd sql global "SELECT id, name, schema, api_extensions, heartbeat, state, arch FROM nodes"
+----+---------------+--------+----------------+-------------------------------------+-------+------+
| id |     name      | schema | api_extensions |              heartbeat              | state | arch |
+----+---------------+--------+----------------+-------------------------------------+-------+------+
| 1  | glf-science-1 | 73     | 382            | 2024-04-12T14:36:44.902403155-03:00 | 0     | 2    |
| 2  | glf-science-3 | 73     | 382            | 2024-04-12T14:36:48.828686825-03:00 | 0     | 2    |
| 4  | glf-science-2 | 73     | 382            | 2024-04-12T14:36:48.483196507-03:00 | 0     | 2    |
| 6  | glf-science-0 | 73     | 382            | 2024-04-12T14:36:47.26775565-03:00  | 0     | 2    |
| 7  | glf-science-4 | 73     | 382            | 2024-04-12T14:36:45.948198612-03:00 | 0     | 2    |
| 8  | glf-science-5 | 69     | 370            | 2024-03-01T10:30:03.816786445-04:00 | 0     | 2    |
+----+---------------+--------+----------------+-------------------------------------+-------+------+

So far, I also tried to down grade all the nodes back to 5.20 to see if that would let me remove the node, but all lxc commands failed with

stoyelq@glf-science-3:~$ lxc ls
Error: Get "http://unix.socket/1.0": EOF

And the status of the daemon had a stack of messages indicating: Error: Failed to initialize global database: failed to ensure schema: this node's version is behind, please upgrade

1 Like

I ended up solving this by manually updating the database and and then removing the offline node as the cluster came out of the locked state:

lxd sql global "UPDATE nodes SET schema=73, api_extensions=382 WHERE id=8"
lxc cluster remove --force glf-science-5