Heartbeat timeouts after upgrade to 4.18

So that does appear to be the differene, here is the output from a member that is reachable:

t=2021-09-14T15:16:08+0000 lvl=dbug msg="Matched trusted cert" fingerprint=65bf8a0cdcf8118bb62b09e541dd6f1cd36269804294602ce95166aa687eb1a6 subject="CN=root@hetzner-04,O=linuxcontainers.org"
t=2021-09-14T15:16:08+0000 lvl=dbug msg="Replace current raft nodes with

There is not a long time difference between those 2 log lines, suggesting that the heartbeat is being responded to in <1s.

Is it possible to manually promote eg app-hel-phys-1 as a leader? The hetzner-* servers are due to be shut down soon

I suspect this due to latency (because its only affecting the members with 25ms latency), however it is possibly being exacerbated by the code used in the heartbeat handler.

This line is where the function that logs “Matched trusted cert” is called from.

And this is the line that logs: “Replace current raft nodes with…”

So the intervening lines are where the latency is being introduced:

And this part has caught my eye as being potentially slow when being done across a WAN link with a cluster that has lots of members.

Each one of the cluster members then causes a remote transaction to the leader to be started in order to get the node info.

It feels like this could be inefficient when being run from a remote location with lots of members to go through.

@mbordere @stgraber what do you think?

1 Like

I’m not sure how to do that, @stgraber @mbordere do you know?

Although I’m not sure that would help as would likely just move the problem so that the “hetzner-*” would be failing heartbeat.

We should make the heartbeat process more efficient in high latency scenarios I think.

Thanks for that!

Yep, I’m okay with that for now as we’re going to not be using the hetzner-* hosts for the time being

Do you feel as if removing most of the hetzner-* hosts would help get the *-hel-* ones back fully online for now?

I found the recent commit that introduced this change:

Which is why its started happening in LXD 4.18.

CC @masnax

I’m working on a PR to fix this now.

1 Like

Thanks! Unsure how the timelines generally are, around how long before it would be available as a snap as edge or beta?

Or do you have any thoughts on getting just the *-hel-* hosts working until release? Would removing the hetzner-* nodes help in your opinion?

I take it you no longer need the ssh connection?

It’s not supported right now to make a specific node the raft leader. What you can do is shut down nodes 1 by 1 until you are left with voter servers among who you want your raft leader to be, it’s not really ideal. You could also try and reconfigure the cluster, but that’s also a manual operation, but I think @masnax has done some work around this.

1 Like

The PR that should fix this is here:

As this is a regression I would imagine @stgraber would cherry-pick it into the current release snap.

1 Like

@stgraber Any idea of an ETA?

Did that fix it? Did you switch to edge snap?

If you do consider switching to the edge snap channel be aware if there are any DB changes you won’t be able to downgrade back to the latest stable release.

> snap refresh --channel=latest/edge lxd
error: cannot refresh "lxd": unexpectedly empty response from the server (try again later)

Am I doing this right?

You are yes.

You are likely being caught out by the rate limiting that the snap store applies to the LXD package when we do a LTS release or change the LTS package (unrelated to what you’re installing) due to capacity issues in the snap store and the large amount of updates this triggers.

And this is happening over the last couple of days due to the change of the LTS package to core20, see Weekly status #215

You might need to retry a few times.

@stgraber was discussing with snap store team whether they can prevent the rate limiting affecting manually started commands rather than periodic refreshes, but I don’t know if anything came of that.

1 Like

Doesn’t seem to have helped. Just confirming that the change is in latest/edge: git-5530217 2021-09-14 (21550) 75MB?

That looks too old, that commit is from the 11th.

You need

or later.

@stgraber any idea why latest/edge snap is out of date?

1 Like

The reason for the delay is that our automated tests are detecting an intermittent issue with LVM since 11th which we are trying to figure out what is causing it. This is holding up edge builds.

1 Like

Once this is merged it should unblock the edge snap builds

Can confirm latest version works. Thanks!

1 Like

Excellent, I would not suggest not staying on latest/edge too long though so as soon as that rev is available in latest/stable switch back so you don’t get other breakages.

Yep, I’ve disabled auto refresh for the time being, and will switch back to stable once available