Recovering cluster after upgrade failure due to some members permanently down

root@kube01:~# sudo snap list lxd
Name Version Rev Tracking Publisher Notes
lxd 5.7-c62733b 23893 latest/stable canonical✓ -

all nodes are on 5.7-c62733b

Except kube02 which is down because the hardware is broken. I would like to take it out of the cluster
but lxc cluster remove --force kube02 just keeps running without feedback

root@kube01:~# lxd sql global "SELECT * FROM nodes"
+----+--------+-------------+---------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| id |  name  | description |       address       | schema | api_extensions |           heartbeat            | state | arch | failure_domain_id |
+----+--------+-------------+---------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| 1  | kube01 |             | 192.168.178.70:8443 | 66     | 332            | 2022-10-30T08:51:26.255385898Z | 0     | 4    | <nil>             |
| 2  | kube02 |             | 192.168.178.78:8443 | 66     | 327            | 2022-10-17T21:04:55.763750682Z | 0     | 4    | <nil>             |
| 3  | kube05 |             | 192.168.178.74:8443 | 66     | 332            | 2022-10-30T08:51:23.891626379Z | 0     | 4    | <nil>             |
| 4  | kube06 |             | 192.168.178.69:8443 | 66     | 332            | 2022-10-30T08:04:56.88908946Z  | 0     | 4    | <nil>             |
| 5  | kube04 |             | 192.168.178.73:8443 | 66     | 332            | 2022-10-30T08:51:22.58399337Z  | 0     | 4    | <nil>             |
| 6  | kube03 |             | 192.168.178.79:8443 | 66     | 332            | 2022-10-30T08:51:26.313618077Z | 0     | 4    | <nil>             |
+----+--------+-------------+---------------------+--------+----------------+--------------------------------+-------+------+-------------------+

on all nodes: /var/snap/lxd/common/lxd/logs/lxd.log

time="2022-10-30T08:30:07Z" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
time="2022-10-30T08:30:16Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"
time="2022-10-30T08:31:20Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"
time="2022-10-30T08:32:23Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"
time="2022-10-30T08:49:14Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"
root@kube01:~# ps aux | grep lxd
root         914  0.0  0.0   2060  1372 ?        Ss   08:29   0:00 /bin/sh /snap/lxd/23893/commands/daemon.activate
root        1068  0.1  1.1 1354104 44228 ?       Sl   08:29   0:03 lxd activateifneeded
root        1102  0.0  0.0   2060  1388 ?        Ss   08:30   0:00 /bin/sh /snap/lxd/23893/commands/daemon.start
root        1279  0.0  0.0 152728  1508 ?        Sl   08:30   0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root        1292  1.2  2.8 1575876 109304 ?      Sl   08:30   0:19 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
root        1293  0.0  0.9 1352796 36824 ?       Sl   08:30   0:00 lxd waitready
root        1294  0.1  0.0   2060   696 ?        S    08:30   0:02 /bin/sh /snap/lxd/23893/commands/daemon.start
root        3186  0.0  0.0   6420  1840 pts/1    S+   08:56   0:00 grep --color=auto lxd

My problem is that I can not do anything usefull right now. I can reboot all nodes. But as soon as I do something like “lxc list” or “lxc network list”, etc. The commands just become unresponsive

Try running:

sudo snap refresh lxd --cohort="+"

On each cluster member, this should ensure they are all running the same version, which is required for cluster operation.

See also Bug #1990954 “snap info <package> sometimes shows conflicting ve...” : Bugs : Snap Store Server

1 Like

I applied it to all nodes , did a reboot. but still the problem stays:

installed:          5.7-c62733b              (23893) 138MB in-cohort

time="2022-10-30T15:17:16Z" level=warning msg="Wait for other cluster nodes to upgrade their versions, cluster not started yet"

I am suspecting a network config problem. but since I can’t do anything like lxc network … I have no ideas

Please show full output of “snap info lxd” on every member

kube01:~$ snap info lxd

snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 13:39 UTC
channels:
  latest/stable:    5.7-c62733b   2022-10-29 (23893) 138MB -
  latest/candidate: 5.7-c62733b   2022-10-28 (23893) 138MB -
  latest/beta:      ↑
  latest/edge:      git-ff53106   2022-10-28 (23902) 138MB -
  5.7/stable:       5.7-c62733b   2022-10-29 (23893) 138MB -
  5.7/candidate:    5.7-c62733b   2022-10-28 (23893) 138MB -
  5.7/beta:         ↑
  5.7/edge:         ↑
  5.6/stable:       5.6-794016a   2022-09-28 (23687) 137MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23687) 137MB -
  5.6/beta:         ↑
  5.6/edge:         ↑
  5.5/stable:       5.5-37534be   2022-08-27 (23543) 112MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23543) 112MB -
  5.5/beta:         ↑
  5.5/edge:         ↑
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23371) 106MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23456) 107MB -
  5.4/beta:         ↑
  5.4/edge:         ↑
  5.3/stable:       5.3-91e042b   2022-07-06 (23274) 106MB -
  5.3/candidate:    5.3-91e042b   2022-07-03 (23274) 106MB -
  5.3/beta:         ↑
  5.3/edge:         ↑
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23545) 106MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23545) 106MB -
  5.0/beta:         ↑
  5.0/edge:         git-13e1e53   2022-08-21 (23567) 111MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22761)  63MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23697)  65MB -
  4.0/beta:         ↑
  4.0/edge:         git-407205d   2022-03-31 (22805)  65MB -
  3.0/stable:       3.0.4         2019-10-10 (11376)  49MB -
  3.0/candidate:    3.0.4         2019-10-10 (11376)  49MB -
  3.0/beta:         ↑
  3.0/edge:         git-81b81b9   2019-10-10 (11378)  49MB -
installed:          5.7-c62733b              (23893) 138MB in-cohort
kube02 is down (hardware defect)
kube03:~$ snap info lxd
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 13:39 UTC
channels:
  latest/stable:    5.7-c62733b   2022-10-29 (23893) 138MB -
  latest/candidate: 5.7-c62733b   2022-10-28 (23893) 138MB -
  latest/beta:      ↑
  latest/edge:      git-ff53106   2022-10-28 (23902) 138MB -
  5.7/stable:       5.7-c62733b   2022-10-29 (23893) 138MB -
  5.7/candidate:    5.7-c62733b   2022-10-28 (23893) 138MB -
  5.7/beta:         ↑
  5.7/edge:         ↑
  5.6/stable:       5.6-794016a   2022-09-28 (23687) 137MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23687) 137MB -
  5.6/beta:         ↑
  5.6/edge:         ↑
  5.5/stable:       5.5-37534be   2022-08-27 (23543) 112MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23543) 112MB -
  5.5/beta:         ↑
  5.5/edge:         ↑
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23371) 106MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23456) 107MB -
  5.4/beta:         ↑
  5.4/edge:         ↑
  5.3/stable:       5.3-91e042b   2022-07-06 (23274) 106MB -
  5.3/candidate:    5.3-91e042b   2022-07-03 (23274) 106MB -
  5.3/beta:         ↑
  5.3/edge:         ↑
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23545) 106MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23545) 106MB -
  5.0/beta:         ↑
  5.0/edge:         git-13e1e53   2022-08-21 (23567) 111MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22761)  63MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23697)  65MB -
  4.0/beta:         ↑
  4.0/edge:         git-407205d   2022-03-31 (22805)  65MB -
  3.0/stable:       3.0.4         2019-10-10 (11376)  49MB -
  3.0/candidate:    3.0.4         2019-10-10 (11376)  49MB -
  3.0/beta:         ↑
  3.0/edge:         git-81b81b9   2019-10-10 (11378)  49MB -
installed:          5.7-c62733b              (23893) 138MB in-cohort
kube04:~$ snap info lxd
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 13:39 UTC
channels:
  latest/stable:    5.7-c62733b   2022-10-29 (23893) 138MB -
  latest/candidate: 5.7-c62733b   2022-10-28 (23893) 138MB -
  latest/beta:      ↑
  latest/edge:      git-ff53106   2022-10-28 (23902) 138MB -
  5.7/stable:       5.7-c62733b   2022-10-29 (23893) 138MB -
  5.7/candidate:    5.7-c62733b   2022-10-28 (23893) 138MB -
  5.7/beta:         ↑
  5.7/edge:         ↑
  5.6/stable:       5.6-794016a   2022-09-28 (23687) 137MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23687) 137MB -
  5.6/beta:         ↑
  5.6/edge:         ↑
  5.5/stable:       5.5-37534be   2022-08-27 (23543) 112MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23543) 112MB -
  5.5/beta:         ↑
  5.5/edge:         ↑
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23371) 106MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23456) 107MB -
  5.4/beta:         ↑
  5.4/edge:         ↑
  5.3/stable:       5.3-91e042b   2022-07-06 (23274) 106MB -
  5.3/candidate:    5.3-91e042b   2022-07-03 (23274) 106MB -
  5.3/beta:         ↑
  5.3/edge:         ↑
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23545) 106MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23545) 106MB -
  5.0/beta:         ↑
  5.0/edge:         git-13e1e53   2022-08-21 (23567) 111MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22761)  63MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23697)  65MB -
  4.0/beta:         ↑
  4.0/edge:         git-407205d   2022-03-31 (22805)  65MB -
  3.0/stable:       3.0.4         2019-10-10 (11376)  49MB -
  3.0/candidate:    3.0.4         2019-10-10 (11376)  49MB -
  3.0/beta:         ↑
  3.0/edge:         git-81b81b9   2019-10-10 (11378)  49MB -
installed:          5.7-c62733b              (23893) 138MB in-cohort
kube05:~$ snap info lxd
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 13:40 UTC
channels:
  latest/stable:    5.7-c62733b   2022-10-29 (23893) 138MB -
  latest/candidate: 5.7-c62733b   2022-10-28 (23893) 138MB -
  latest/beta:      ↑
  latest/edge:      git-ff53106   2022-10-28 (23902) 138MB -
  5.7/stable:       5.7-c62733b   2022-10-29 (23893) 138MB -
  5.7/candidate:    5.7-c62733b   2022-10-28 (23893) 138MB -
  5.7/beta:         ↑
  5.7/edge:         ↑
  5.6/stable:       5.6-794016a   2022-09-28 (23687) 137MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23687) 137MB -
  5.6/beta:         ↑
  5.6/edge:         ↑
  5.5/stable:       5.5-37534be   2022-08-27 (23543) 112MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23543) 112MB -
  5.5/beta:         ↑
  5.5/edge:         ↑
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23371) 106MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23456) 107MB -
  5.4/beta:         ↑
  5.4/edge:         ↑
  5.3/stable:       5.3-91e042b   2022-07-06 (23274) 106MB -
  5.3/candidate:    5.3-91e042b   2022-07-03 (23274) 106MB -
  5.3/beta:         ↑
  5.3/edge:         ↑
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23545) 106MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23545) 106MB -
  5.0/beta:         ↑
  5.0/edge:         git-13e1e53   2022-08-21 (23567) 111MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22761)  63MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23697)  65MB -
  4.0/beta:         ↑
  4.0/edge:         git-407205d   2022-03-31 (22805)  65MB -
  3.0/stable:       3.0.4         2019-10-10 (11376)  49MB -
  3.0/candidate:    3.0.4         2019-10-10 (11376)  49MB -
  3.0/beta:         ↑
  3.0/edge:         git-81b81b9   2019-10-10 (11378)  49MB -
installed:          5.7-c62733b              (23893) 138MB in-cohort
kube06:~$ snap info lxd
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 16:35 CEST
channels:
  latest/stable:    5.7-c62733b   2022-10-29 (23893) 138MB -
  latest/candidate: 5.7-c62733b   2022-10-28 (23893) 138MB -
  latest/beta:      ↑
  latest/edge:      git-ff53106   2022-10-28 (23902) 138MB -
  5.7/stable:       5.7-c62733b   2022-10-29 (23893) 138MB -
  5.7/candidate:    5.7-c62733b   2022-10-28 (23893) 138MB -
  5.7/beta:         ↑
  5.7/edge:         ↑
  5.6/stable:       5.6-794016a   2022-09-28 (23687) 137MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23687) 137MB -
  5.6/beta:         ↑
  5.6/edge:         ↑
  5.5/stable:       5.5-37534be   2022-08-27 (23543) 112MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23543) 112MB -
  5.5/beta:         ↑
  5.5/edge:         ↑
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23371) 106MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23456) 107MB -
  5.4/beta:         ↑
  5.4/edge:         ↑
  5.3/stable:       5.3-91e042b   2022-07-06 (23274) 106MB -
  5.3/candidate:    5.3-91e042b   2022-07-03 (23274) 106MB -
  5.3/beta:         ↑
  5.3/edge:         ↑
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23545) 106MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23545) 106MB -
  5.0/beta:         ↑
  5.0/edge:         git-13e1e53   2022-08-21 (23567) 111MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22761)  63MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23697)  65MB -
  4.0/beta:         ↑
  4.0/edge:         git-407205d   2022-03-31 (22805)  65MB -
  3.0/stable:       3.0.4         2019-10-10 (11376)  49MB -
  3.0/candidate:    3.0.4         2019-10-10 (11376)  49MB -
  3.0/beta:         ↑
  3.0/edge:         git-81b81b9   2019-10-10 (11378)  49MB -
installed:          5.7-c62733b              (23893) 138MB in-cohort

Ah ok that missing machine will be the problem then.

Are you happy to remove it and its instances permanently from the cluster?

If so see How to manage a cluster - LXD documentation

yes but the “lxc cluster remove --force kube02” just keeps running without any output

I see, yes that makes sense as LXD isn’t able to start properly.

Please can you show the output of sudo lxd sql global 'select * from nodes'?

1 Like

±—±-------±------------±--------------------±-------±---------------±-------------------------------±------±-----±------------------+
| id | name | description | address | schema | api_extensions | heartbeat | state | arch | failure_domain_id |
±—±-------±------------±--------------------±-------±---------------±-------------------------------±------±-----±------------------+
| 1 | kube01 | | 192.168.178.70:8443 | 66 | 332 | 2022-10-31T17:36:22.632751814Z | 0 | 4 | |
| 2 | kube02 | | 192.168.178.78:8443 | 66 | 327 | 2022-10-17T21:04:55.763750682Z | 0 | 4 | |
| 3 | kube05 | | 192.168.178.74:8443 | 66 | 332 | 2022-10-31T17:36:24.060297351Z | 0 | 4 | |
| 4 | kube06 | | 192.168.178.69:8443 | 66 | 332 | 2022-10-30T18:41:40.399802969Z | 0 | 4 | |
| 5 | kube04 | | 192.168.178.73:8443 | 66 | 332 | 2022-10-31T17:36:23.553059306Z | 0 | 4 | |
| 6 | kube03 | | 192.168.178.79:8443 | 66 | 332 | 2022-10-31T17:36:25.58072139Z | 0 | 4 | |
±—±-------±------------±--------------------±-------±---------------±-------------------------------±------±-----±------------------+

Thanks, can you try running:

sudo lxd sql global 'DELETE FROM nodes WHERE name = "kube02"'

This should then allow the cluster to recover.

Great thanks alot! That recovered the cluster!

If I want change the IP addresses of some members of the cluster would then for each node

sudo lxd sql global "update nodes set address=10.100.100.2 where id =1"

be the only config I would need to change?

1 Like

No, address changes are quite a bit more difficult to do as they’re not just stored in the LXD database but also in the local database and in the dqlite headers.

How to recover a cluster - LXD documentation covers what needs to be done for that part.

1 Like