I suspect this due to latency (because its only affecting the members with 25ms latency), however it is possibly being exacerbated by the code used in the heartbeat handler.
This line is where the function that logs “Matched trusted cert” is called from.
And this is the line that logs: “Replace current raft nodes with…”
So the intervening lines are where the latency is being introduced:
And this part has caught my eye as being potentially slow when being done across a WAN link with a cluster that has lots of members.
Each one of the cluster members then causes a remote transaction to the leader to be started in order to get the node info.
It feels like this could be inefficient when being run from a remote location with lots of members to go through.
It’s not supported right now to make a specific node the raft leader. What you can do is shut down nodes 1 by 1 until you are left with voter servers among who you want your raft leader to be, it’s not really ideal. You could also try and reconfigure the cluster, but that’s also a manual operation, but I think @masnax has done some work around this.
If you do consider switching to the edge snap channel be aware if there are any DB changes you won’t be able to downgrade back to the latest stable release.
You are likely being caught out by the rate limiting that the snap store applies to the LXD package when we do a LTS release or change the LTS package (unrelated to what you’re installing) due to capacity issues in the snap store and the large amount of updates this triggers.
And this is happening over the last couple of days due to the change of the LTS package to core20, see Weekly status #215
You might need to retry a few times.
@stgraber was discussing with snap store team whether they can prevent the rate limiting affecting manually started commands rather than periodic refreshes, but I don’t know if anything came of that.
The reason for the delay is that our automated tests are detecting an intermittent issue with LVM since 11th which we are trying to figure out what is causing it. This is holding up edge builds.
Excellent, I would not suggest not staying on latest/edge too long though so as soon as that rev is available in latest/stable switch back so you don’t get other breakages.