Recovering cluster after update

stoyelq · April 12, 2024, 5:40pm

I am running a 6 node cluster. About a month ago, one of our nodes needed repairs and went offline. This was not an issue until today I noticed that my cluster was in a blocked state, following the latest snap update to LXD5.21 where the offline node was the only one still on 5.20. After reading the docs a bit closer (How to manage a cluster - LXD documentation), I now see I should have removed that node from the cluster.

I am unsure how to go about recovering my cluster, is it possible to somehow remove the offline node from the cluster without access to lxc cluster remove as all the lxc commands hang?

Here are the nodes in question, (glf-science-5 is the offline one):

stoyelq@glf-science-1:~$ lxd sql global "SELECT id, name, schema, api_extensions, heartbeat, state, arch FROM nodes"
+----+---------------+--------+----------------+-------------------------------------+-------+------+
| id |     name      | schema | api_extensions |              heartbeat              | state | arch |
+----+---------------+--------+----------------+-------------------------------------+-------+------+
| 1  | glf-science-1 | 73     | 382            | 2024-04-12T14:36:44.902403155-03:00 | 0     | 2    |
| 2  | glf-science-3 | 73     | 382            | 2024-04-12T14:36:48.828686825-03:00 | 0     | 2    |
| 4  | glf-science-2 | 73     | 382            | 2024-04-12T14:36:48.483196507-03:00 | 0     | 2    |
| 6  | glf-science-0 | 73     | 382            | 2024-04-12T14:36:47.26775565-03:00  | 0     | 2    |
| 7  | glf-science-4 | 73     | 382            | 2024-04-12T14:36:45.948198612-03:00 | 0     | 2    |
| 8  | glf-science-5 | 69     | 370            | 2024-03-01T10:30:03.816786445-04:00 | 0     | 2    |
+----+---------------+--------+----------------+-------------------------------------+-------+------+

So far, I also tried to down grade all the nodes back to 5.20 to see if that would let me remove the node, but all lxc commands failed with

stoyelq@glf-science-3:~$ lxc ls
Error: Get "http://unix.socket/1.0": EOF

And the status of the daemon had a stack of messages indicating: Error: Failed to initialize global database: failed to ensure schema: this node's version is behind, please upgrade

stoyelq · April 15, 2024, 12:44pm

I ended up solving this by manually updating the database and and then removing the offline node as the cluster came out of the locked state:

lxd sql global "UPDATE nodes SET schema=73, api_extensions=382 WHERE id=8"
lxc cluster remove --force glf-science-5

system · May 15, 2024, 12:44pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.