tomp
(Thomas Parrott)
September 14, 2021, 3:17pm
22
So that does appear to be the differene, here is the output from a member that is reachable:
t=2021-09-14T15:16:08+0000 lvl=dbug msg="Matched trusted cert" fingerprint=65bf8a0cdcf8118bb62b09e541dd6f1cd36269804294602ce95166aa687eb1a6 subject="CN=root@hetzner-04,O=linuxcontainers.org"
t=2021-09-14T15:16:08+0000 lvl=dbug msg="Replace current raft nodes with
There is not a long time difference between those 2 log lines, suggesting that the heartbeat is being responded to in <1s.
Is it possible to manually promote eg app-hel-phys-1 as a leader? The hetzner-*
servers are due to be shut down soon
tomp
(Thomas Parrott)
September 14, 2021, 3:27pm
24
I suspect this due to latency (because its only affecting the members with 25ms latency), however it is possibly being exacerbated by the code used in the heartbeat handler.
This line is where the function that logs “Matched trusted cert” is called from.
// for /internal/db, which handle respectively raft and gRPC-SQL requests.
//
// These handlers might return 404, either because this LXD node is a
// non-clustered node not available over the network or because it is not a
// database node part of the dqlite cluster.
func (g *Gateway) HandlerFuncs(nodeRefreshTask func(*APIHeartbeat), trustedCerts func() map[db.CertificateType]map[string]x509.Certificate) map[string]http.HandlerFunc {
database := func(w http.ResponseWriter, r *http.Request) {
g.lock.RLock()
defer g.lock.RUnlock()
if !tlsCheckCert(r, g.networkCert, g.serverCert(), trustedCerts()) {
http.Error(w, "403 invalid client certificate", http.StatusForbidden)
return
}
// Compare the dqlite version of the connecting client
// with our own one.
versionHeader := r.Header.Get("X-Dqlite-Version")
if versionHeader == "" {
// No version header means an old pre dqlite 1.0 client.
versionHeader = "0"
And this is the line that logs: “Replace current raft nodes with…”
Role: db.RaftRole(node.RaftRole),
},
Name: nodeInfo.Name,
})
}
}
// Check we have been sent at least 1 raft node before wiping our set.
if len(raftNodes) > 0 {
// Accept Raft node updates from any node (joining nodes just send raft nodes heartbeat data).
logger.Debugf("Replace current raft nodes with %+v", raftNodes)
err = g.db.Transaction(func(tx *db.NodeTx) error {
return tx.ReplaceRaftNodes(raftNodes)
})
if err != nil {
logger.Error("Error updating raft members", log.Ctx{"err": err})
http.Error(w, "500 failed to update raft nodes", http.StatusInternalServerError)
return
}
// If there is an ongoing heartbeat round (and by implication this is the leader),
So the intervening lines are where the latency is being introduced:
// Compare the dqlite version of the connecting client
// with our own one.
versionHeader := r.Header.Get("X-Dqlite-Version")
if versionHeader == "" {
// No version header means an old pre dqlite 1.0 client.
versionHeader = "0"
}
version, err := strconv.Atoi(versionHeader)
if err != nil {
http.Error(w, "400 invalid dqlite version", http.StatusBadRequest)
return
}
if version != dqliteVersion {
if version > dqliteVersion {
if !g.upgradeTriggered {
err = triggerUpdate()
if err == nil {
g.upgradeTriggered = true
}
}
This file has been truncated. show original
And this part has caught my eye as being potentially slow when being done across a WAN link with a cluster that has lots of members.
for _, node := range heartbeatData.Members {
if node.RaftID > 0 {
nodeInfo := db.NodeInfo{}
if g.Cluster != nil {
err = g.Cluster.Transaction(func(tx *db.ClusterTx) error {
var err error
nodeInfo, err = tx.GetNodeByAddress(node.Address)
return err
})
if err != nil {
logger.Warn("Failed to retrieve cluster member", log.Ctx{"err": err})
}
}
raftNodes = append(raftNodes, db.RaftNode{
NodeInfo: client.NodeInfo{
ID: node.RaftID,
Address: node.Address,
Role: db.RaftRole(node.RaftRole),
},
This file has been truncated. show original
Each one of the cluster members then causes a remote transaction to the leader to be started in order to get the node info.
It feels like this could be inefficient when being run from a remote location with lots of members to go through.
@mbordere @stgraber what do you think?
1 Like
tomp
(Thomas Parrott)
September 14, 2021, 3:30pm
25
I’m not sure how to do that, @stgraber @mbordere do you know?
Although I’m not sure that would help as would likely just move the problem so that the “hetzner-*” would be failing heartbeat.
We should make the heartbeat process more efficient in high latency scenarios I think.
Thanks for that!
Yep, I’m okay with that for now as we’re going to not be using the hetzner-* hosts for the time being
Do you feel as if removing most of the hetzner-*
hosts would help get the *-hel-*
ones back fully online for now?
tomp
(Thomas Parrott)
September 14, 2021, 3:47pm
27
I found the recent commit that introduced this change:
lxc:master
← masnax:recovery/update-db
opened 06:43PM - 08 Sep 21 UTC
This PR adds the cluster member name field as a column to the `raft_nodes` local… table. Since the dqlite representation of a raft node is still limited to `ID,` `Address`, and `Role`, the `Name` is extracted from the global `nodes` database, if the global database is accessible. If not, the field will not be populated.
Also addressed a few small bugs:
* Typo in `RemoteRaftNode`, should be `RemoveRaftNode`
* If `patch.global.sql` exists already, append to it in `lxd cluster edit`
* Make `latest_segment` a comment instead of a yaml field to prevent any confusion with its value being different node-to-node.
Which is why its started happening in LXD 4.18.
CC @masnax
I’m working on a PR to fix this now.
1 Like
Thanks! Unsure how the timelines generally are, around how long before it would be available as a snap as edge or beta?
Or do you have any thoughts on getting just the *-hel-*
hosts working until release? Would removing the hetzner-*
nodes help in your opinion?
I take it you no longer need the ssh connection?
mbordere
(Mathieu Bordere)
September 14, 2021, 4:00pm
29
It’s not supported right now to make a specific node the raft leader. What you can do is shut down nodes 1 by 1 until you are left with voter servers among who you want your raft leader to be, it’s not really ideal. You could also try and reconfigure the cluster, but that’s also a manual operation, but I think @masnax has done some work around this.
1 Like
tomp
(Thomas Parrott)
September 14, 2021, 4:34pm
30
The PR that should fix this is here:
lxc:master
← tomponline:tp-cluster-heartbeat-handler
opened 04:31PM - 14 Sep 21 UTC
Fixes issue introduced by https://github.com/lxc/lxd/pull/9209/commits/1db065b06… b2e1b16ea997927d158a4cae5162714 that was reported from https://discuss.linuxcontainers.org/t/heartbeat-timeouts-after-upgrade-to-4-18/12164
The action of performing 1 transaction and query per raft node in heartbeat payload (in order to enrich the member name) was causing the heartbeat handler to take >1s in scenarios where the recipient cluster member was in a remote DC (so the queries back to the leader were slower) and in cases where there was a larger cluster with more raft members.
When the heartbeat handle exceeds 1s, the leader considers the member offline.
This PR instead adds the member name to the heartbeat payload and performs the enrichment in a single transaction and query on the leader. The name is then available to the recipient cluster members to populate their local raft nodes table as needed.
As this is a regression I would imagine @stgraber would cherry-pick it into the current release snap.
1 Like
@stgraber Any idea of an ETA?
tomp
(Thomas Parrott)
September 16, 2021, 12:08pm
32
Did that fix it? Did you switch to edge snap?
If you do consider switching to the edge snap channel be aware if there are any DB changes you won’t be able to downgrade back to the latest stable release.
> snap refresh --channel=latest/edge lxd
error: cannot refresh "lxd": unexpectedly empty response from the server (try again later)
Am I doing this right?
tomp
(Thomas Parrott)
September 16, 2021, 1:29pm
34
You are yes.
You are likely being caught out by the rate limiting that the snap store applies to the LXD package when we do a LTS release or change the LTS package (unrelated to what you’re installing) due to capacity issues in the snap store and the large amount of updates this triggers.
And this is happening over the last couple of days due to the change of the LTS package to core20, see Weekly status #215
You might need to retry a few times.
@stgraber was discussing with snap store team whether they can prevent the rate limiting affecting manually started commands rather than periodic refreshes, but I don’t know if anything came of that.
1 Like
Doesn’t seem to have helped. Just confirming that the change is in latest/edge: git-5530217 2021-09-14 (21550) 75MB
?
tomp
(Thomas Parrott)
September 17, 2021, 10:49am
36
That looks too old, that commit is from the 11th.
committed 02:51AM - 11 Sep 21 UTC
Don't allow instance ipv{n}.address to be same as managed parent network
You need
committed 07:46PM - 14 Sep 21 UTC
Cluster: Fix slow heartbeat response due to multiple remote queries when popula… ting raft node names
or later.
@stgraber any idea why latest/edge snap is out of date?
1 Like
tomp
(Thomas Parrott)
September 17, 2021, 2:41pm
37
The reason for the delay is that our automated tests are detecting an intermittent issue with LVM since 11th which we are trying to figure out what is causing it. This is holding up edge builds.
1 Like
tomp
(Thomas Parrott)
September 17, 2021, 6:42pm
38
Once this is merged it should unblock the edge snap builds
lxc:master
← tomponline:tp-lvm-fix
opened 05:33PM - 17 Sep 21 UTC
Triggered by https://github.com/lxc/lxd/pull/9217
Can confirm latest version works. Thanks!
1 Like
tomp
(Thomas Parrott)
September 20, 2021, 2:27pm
40
Excellent, I would not suggest not staying on latest/edge too long though so as soon as that rev is available in latest/stable switch back so you don’t get other breakages.
Yep, I’ve disabled auto refresh for the time being, and will switch back to stable once available