LXD cluster failed

Hello everyone,

Yesterday I’ve create a cluster from 3 machines. Today I’ve logged in and the cluster is not responsive.

Only test02 it’s kinda responsive. I’ve managed to do lxc cluster list.

Looks like the test02 server is not upgraded to same version as the rest of them.
The other ones will be waiting for it to reach the same version.

Can you show snap info lxd on all 3 servers?

One server failed and the whole cluster is down. How we can prevent this?

test01:

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: today at 06:21 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21723) 76MB -

test02

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: yesterday at 08:40 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21624) 76MB -

test03

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: today at 06:21 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21723) 76MB -

Looks like test02 isn’t running the latest revision version of the snap, and I understand that all members of the cluster should be running the same revision.

@stgraber is this correct, and if so, shouldn’t the snap service now deliver the same revision to all cluster members when they refresh?

test01:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:05 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

test02:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:01 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

test03:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:11 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

Try running snap switch lxd --cohort=+ on all systems followed by snap refresh lxd

1 Like

test01:

error: snap “lxd” has “auto-refresh” change in progress

test02:

“lxd” switched to the “+” cohort

test03:

error: snap “lxd” has “refresh-snap” change in progress

After update lxd on test02 I was able to execute the command above successfully.

So, how we can prevent one faulty server to break the whole cluster?

And what is doing “lxd” switched to the “+” cohort

Now that you’ve manually done this, you should be good moving forward, though I’m confused that the cohort key wasn’t properly set already.

Basically we rollout fixed versions of LXD progressively to avoid overloading the store. Because all clustered servers should be on the same version, there is a special key (+) which can be passed to the store to tell it that no matter the state of the phased rollout, we want all those systems to get the revision being pushed out.

Got it. Thank you.

How do the other members get added to the cohort, is this something we could apply via a cluster notification when the first member upgrades?

We’re currently doing it in two places. On daemon startup, we check for cluster.crt and if found, attempt to set the cohort key.

Then we again attempt to set the cohort key just before calling snap refresh in the cluster notification hook.

Hrm, strange so either the cohort key setting failed, the notification never came, or something unset it again.