LXD cluster failed

mihailigoshin · October 15, 2021, 7:52am

Hello everyone,

Yesterday I’ve create a cluster from 3 machines. Today I’ve logged in and the cluster is not responsive.

Only test02 it’s kinda responsive. I’ve managed to do lxc cluster list.

tomp · October 15, 2021, 12:19pm

Looks like the test02 server is not upgraded to same version as the rest of them.
The other ones will be waiting for it to reach the same version.

Can you show snap info lxd on all 3 servers?

mihailigoshin · October 15, 2021, 1:06pm

One server failed and the whole cluster is down. How we can prevent this?

test01:

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: today at 06:21 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21723) 76MB -

test02

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: yesterday at 08:40 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21624) 76MB -

test03

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: today at 06:21 UTC
channels:
latest/stable: 4.19 2021-10-06 (21624) 76MB -
latest/candidate: 4.19 2021-10-14 (21723) 76MB -
latest/beta: ↑
latest/edge: git-1fe2751 2021-10-15 (21730) 76MB -
4.19/stable: 4.19 2021-10-06 (21624) 76MB -
4.19/candidate: 4.19 2021-10-14 (21723) 76MB -
4.19/beta: ↑
4.19/edge: ↑
4.18/stable: 4.18 2021-09-13 (21497) 75MB -
4.18/candidate: 4.18 2021-09-15 (21554) 75MB -
4.18/beta: ↑
4.18/edge: ↑
4.0/stable: 4.0.7 2021-10-04 (21545) 70MB -
4.0/candidate: 4.0.7 2021-10-04 (21545) 70MB -
4.0/beta: ↑
4.0/edge: git-298853f 2021-10-12 (21700) 70MB -
3.0/stable: 3.0.4 2019-10-10 (11348) 55MB -
3.0/candidate: 3.0.4 2019-10-10 (11348) 55MB -
3.0/beta: ↑
3.0/edge: git-81b81b9 2019-10-10 (11362) 55MB -
2.0/stable: 2.0.12 2020-08-18 (16879) 38MB -
2.0/candidate: 2.0.12 2021-03-22 (19859) 39MB -
2.0/beta: ↑
2.0/edge: git-82c7d62 2021-03-22 (19857) 39MB -
installed: 4.19 (21723) 76MB -

tomp · October 15, 2021, 1:19pm

Looks like test02 isn’t running the latest revision version of the snap, and I understand that all members of the cluster should be running the same revision.

@stgraber is this correct, and if so, shouldn’t the snap service now deliver the same revision to all cluster members when they refresh?

mihailigoshin · October 15, 2021, 1:23pm

test01:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:05 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

test02:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:01 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

test03:

type: snapd
snap-id: PMrrV4ml8uWuEUDBT8dSGnKUYbevVhc4
tracking: latest/stable
refresh-date: 2 days ago, at 07:11 UTC
channels:
latest/stable: 2.51.7 2021-09-23 (13170) 33MB -
latest/candidate: 2.52.1 2021-10-11 (13640) 34MB -
latest/beta: 2.52.1 2021-10-05 (13640) 34MB -
latest/edge: 2.52.1+git1353.g6d1a7e7 2021-10-15 (13739) 43MB -
installed: 2.52 (13270) 33MB snapd

stgraber · October 15, 2021, 5:53pm

Try running snap switch lxd --cohort=+ on all systems followed by snap refresh lxd

mihailigoshin · October 15, 2021, 6:11pm

test01:

error: snap “lxd” has “auto-refresh” change in progress

test02:

“lxd” switched to the “+” cohort

test03:

error: snap “lxd” has “refresh-snap” change in progress

After update lxd on test02 I was able to execute the command above successfully.

So, how we can prevent one faulty server to break the whole cluster?

And what is doing “lxd” switched to the “+” cohort

stgraber · October 15, 2021, 6:37pm

Now that you’ve manually done this, you should be good moving forward, though I’m confused that the cohort key wasn’t properly set already.

Basically we rollout fixed versions of LXD progressively to avoid overloading the store. Because all clustered servers should be on the same version, there is a special key (+) which can be passed to the store to tell it that no matter the state of the phased rollout, we want all those systems to get the revision being pushed out.

mihailigoshin · October 15, 2021, 6:50pm

Got it. Thank you.

tomp · October 15, 2021, 8:41pm

How do the other members get added to the cohort, is this something we could apply via a cluster notification when the first member upgrades?

stgraber · October 15, 2021, 8:44pm

We’re currently doing it in two places. On daemon startup, we check for cluster.crt and if found, attempt to set the cohort key.

Then we again attempt to set the cohort key just before calling snap refresh in the cluster notification hook.

tomp · October 16, 2021, 8:06pm

Hrm, strange so either the cohort key setting failed, the notification never came, or something unset it again.