Heartbeat timeouts after upgrade to 4.18

tomp · September 14, 2021, 3:17pm

So that does appear to be the differene, here is the output from a member that is reachable:

t=2021-09-14T15:16:08+0000 lvl=dbug msg="Matched trusted cert" fingerprint=65bf8a0cdcf8118bb62b09e541dd6f1cd36269804294602ce95166aa687eb1a6 subject="CN=root@hetzner-04,O=linuxcontainers.org"
t=2021-09-14T15:16:08+0000 lvl=dbug msg="Replace current raft nodes with

There is not a long time difference between those 2 log lines, suggesting that the heartbeat is being responded to in <1s.

theblazehen · September 14, 2021, 3:21pm

Is it possible to manually promote eg app-hel-phys-1 as a leader? The hetzner-* servers are due to be shut down soon

tomp · September 14, 2021, 3:27pm

I suspect this due to latency (because its only affecting the members with 25ms latency), however it is possibly being exacerbated by the code used in the heartbeat handler.

This line is where the function that logs “Matched trusted cert” is called from.

github.com

lxc/lxd/blob/7eb2aa1c989a7104fdbbd50f1a3169d07d196420/lxd/cluster/gateway.go#L157

    
      
          // for /internal/db, which handle respectively raft and gRPC-SQL requests.
          //
          // These handlers might return 404, either because this LXD node is a
          // non-clustered node not available over the network or because it is not a
          // database node part of the dqlite cluster.
          func (g *Gateway) HandlerFuncs(nodeRefreshTask func(*APIHeartbeat), trustedCerts func() map[db.CertificateType]map[string]x509.Certificate) map[string]http.HandlerFunc {
          	database := func(w http.ResponseWriter, r *http.Request) {
          		g.lock.RLock()
          		defer g.lock.RUnlock()
          
          
		if !tlsCheckCert(r, g.networkCert, g.serverCert(), trustedCerts()) {
          			http.Error(w, "403 invalid client certificate", http.StatusForbidden)
          			return
          		}
          
          
		// Compare the dqlite version of the connecting client
          		// with our own one.
          		versionHeader := r.Header.Get("X-Dqlite-Version")
          		if versionHeader == "" {
          			// No version header means an old pre dqlite 1.0 client.
          			versionHeader = "0"

And this is the line that logs: “Replace current raft nodes with…”

github.com

lxc/lxd/blob/7eb2aa1c989a7104fdbbd50f1a3169d07d196420/lxd/cluster/gateway.go#L258

    
      
          							Role:    db.RaftRole(node.RaftRole),
          						},
          						Name: nodeInfo.Name,
          					})
          				}
          			}
          
          
			// Check we have been sent at least 1 raft node before wiping our set.
          			if len(raftNodes) > 0 {
          				// Accept Raft node updates from any node (joining nodes just send raft nodes heartbeat data).
          				logger.Debugf("Replace current raft nodes with %+v", raftNodes)
          				err = g.db.Transaction(func(tx *db.NodeTx) error {
          					return tx.ReplaceRaftNodes(raftNodes)
          				})
          				if err != nil {
          					logger.Error("Error updating raft members", log.Ctx{"err": err})
          					http.Error(w, "500 failed to update raft nodes", http.StatusInternalServerError)
          					return
          				}
          
          
				// If there is an ongoing heartbeat round (and by implication this is the leader),

So the intervening lines are where the latency is being introduced:

github.com

lxc/lxd/blob/7eb2aa1c989a7104fdbbd50f1a3169d07d196420/lxd/cluster/gateway.go#L162-L254

    
      
          		// Compare the dqlite version of the connecting client
          		// with our own one.
          		versionHeader := r.Header.Get("X-Dqlite-Version")
          		if versionHeader == "" {
          			// No version header means an old pre dqlite 1.0 client.
          			versionHeader = "0"
          		}
          		version, err := strconv.Atoi(versionHeader)
          		if err != nil {
          			http.Error(w, "400 invalid dqlite version", http.StatusBadRequest)
          			return
          		}
          		if version != dqliteVersion {
          			if version > dqliteVersion {
          				if !g.upgradeTriggered {
          					err = triggerUpdate()
          					if err == nil {
          						g.upgradeTriggered = true
          					}
          				}

This file has been truncated. show original

And this part has caught my eye as being potentially slow when being done across a WAN link with a cluster that has lots of members.

github.com

lxc/lxd/blob/7eb2aa1c989a7104fdbbd50f1a3169d07d196420/lxd/cluster/gateway.go#L230-L253

    
      
          			for _, node := range heartbeatData.Members {
          				if node.RaftID > 0 {
          					nodeInfo := db.NodeInfo{}
          					if g.Cluster != nil {
          						err = g.Cluster.Transaction(func(tx *db.ClusterTx) error {
          							var err error
          							nodeInfo, err = tx.GetNodeByAddress(node.Address)
          							return err
          						})
          						if err != nil {
          							logger.Warn("Failed to retrieve cluster member", log.Ctx{"err": err})
          						}
          					}
          
          
					raftNodes = append(raftNodes, db.RaftNode{
          						NodeInfo: client.NodeInfo{
          							ID:      node.RaftID,
          							Address: node.Address,
          							Role:    db.RaftRole(node.RaftRole),
          						},

This file has been truncated. show original

Each one of the cluster members then causes a remote transaction to the leader to be started in order to get the node info.

It feels like this could be inefficient when being run from a remote location with lots of members to go through.

@mbordere @stgraber what do you think?

tomp · September 14, 2021, 3:30pm

I’m not sure how to do that, @stgraber @mbordere do you know?

Although I’m not sure that would help as would likely just move the problem so that the “hetzner-*” would be failing heartbeat.

We should make the heartbeat process more efficient in high latency scenarios I think.

theblazehen · September 14, 2021, 3:32pm

Thanks for that!

Yep, I’m okay with that for now as we’re going to not be using the hetzner-* hosts for the time being

Do you feel as if removing most of the hetzner-* hosts would help get the *-hel-* ones back fully online for now?

tomp · September 14, 2021, 3:47pm

I found the recent commit that introduced this change:

github.com/lxc/lxd

Add 'name' to 'raft_nodes' table.

lxc:master ← masnax:recovery/update-db

opened 06:43PM - 08 Sep 21 UTC

masnax

+217 -95

This PR adds the cluster member name field as a column to the `raft_nodes` local… table. Since the dqlite representation of a raft node is still limited to `ID,` `Address`, and `Role`, the `Name` is extracted from the global `nodes` database, if the global database is accessible. If not, the field will not be populated. Also addressed a few small bugs: * Typo in `RemoteRaftNode`, should be `RemoveRaftNode` * If `patch.global.sql` exists already, append to it in `lxd cluster edit` * Make `latest_segment` a comment instead of a yaml field to prevent any confusion with its value being different node-to-node.

Which is why its started happening in LXD 4.18.

CC @masnax

I’m working on a PR to fix this now.

theblazehen · September 14, 2021, 3:51pm

Thanks! Unsure how the timelines generally are, around how long before it would be available as a snap as edge or beta?

Or do you have any thoughts on getting just the *-hel-* hosts working until release? Would removing the hetzner-* nodes help in your opinion?

I take it you no longer need the ssh connection?

mbordere · September 14, 2021, 4:00pm

It’s not supported right now to make a specific node the raft leader. What you can do is shut down nodes 1 by 1 until you are left with voter servers among who you want your raft leader to be, it’s not really ideal. You could also try and reconfigure the cluster, but that’s also a manual operation, but I think @masnax has done some work around this.

tomp · September 14, 2021, 4:34pm

The PR that should fix this is here:

github.com/lxc/lxd

Cluster: Fix slow heartbeat response due to multiple remote queries when populating raft node names

lxc:master ← tomponline:tp-cluster-heartbeat-handler

opened 04:31PM - 14 Sep 21 UTC

tomponline

+20 -21

Fixes issue introduced by https://github.com/lxc/lxd/pull/9209/commits/1db065b06…b2e1b16ea997927d158a4cae5162714 that was reported from https://discuss.linuxcontainers.org/t/heartbeat-timeouts-after-upgrade-to-4-18/12164 The action of performing 1 transaction and query per raft node in heartbeat payload (in order to enrich the member name) was causing the heartbeat handler to take >1s in scenarios where the recipient cluster member was in a remote DC (so the queries back to the leader were slower) and in cases where there was a larger cluster with more raft members. When the heartbeat handle exceeds 1s, the leader considers the member offline. This PR instead adds the member name to the heartbeat payload and performs the enrichment in a single transaction and query on the leader. The name is then available to the recipient cluster members to populate their local raft nodes table as needed.

As this is a regression I would imagine @stgraber would cherry-pick it into the current release snap.

theblazehen · September 16, 2021, 8:43am

@stgraber Any idea of an ETA?

tomp · September 16, 2021, 12:08pm

Did that fix it? Did you switch to edge snap?

If you do consider switching to the edge snap channel be aware if there are any DB changes you won’t be able to downgrade back to the latest stable release.

theblazehen · September 16, 2021, 1:26pm

> snap refresh --channel=latest/edge lxd
error: cannot refresh "lxd": unexpectedly empty response from the server (try again later)

Am I doing this right?

tomp · September 16, 2021, 1:29pm

You are yes.

You are likely being caught out by the rate limiting that the snap store applies to the LXD package when we do a LTS release or change the LTS package (unrelated to what you’re installing) due to capacity issues in the snap store and the large amount of updates this triggers.

And this is happening over the last couple of days due to the change of the LTS package to core20, see Weekly status #215

You might need to retry a few times.

@stgraber was discussing with snap store team whether they can prevent the rate limiting affecting manually started commands rather than periodic refreshes, but I don’t know if anything came of that.

theblazehen · September 17, 2021, 10:42am

Doesn’t seem to have helped. Just confirming that the change is in latest/edge: git-5530217 2021-09-14 (21550) 75MB?

tomp · September 17, 2021, 10:49am

That looks too old, that commit is from the 11th.

You need

or later.

@stgraber any idea why latest/edge snap is out of date?

tomp · September 17, 2021, 2:41pm

The reason for the delay is that our automated tests are detecting an intermittent issue with LVM since 11th which we are trying to figure out what is causing it. This is holding up edge builds.

tomp · September 17, 2021, 6:42pm

Once this is merged it should unblock the edge snap builds

theblazehen · September 19, 2021, 10:47am

Can confirm latest version works. Thanks!

tomp · September 20, 2021, 2:27pm

Excellent, I would not suggest not staying on latest/edge too long though so as soon as that rev is available in latest/stable switch back so you don’t get other breakages.

theblazehen · September 20, 2021, 2:40pm

Yep, I’ve disabled auto refresh for the time being, and will switch back to stable once available