Cluster node appears offline after upgrade to 3.15

That’s “expected” given the issue. We’re effectively not finding a match for the node due to the ID mismatch, so you’re getting the 0 date timestamp rather than a valid value.

Might be interesting, logs say that it thinks that the offline node is the leader and is the one with the 0 date: t=2019-07-18T03:11:50-0500 lvl=warn msg=“Leader has not initiated heartbeat since ‘0001-01-01 00:00:00 +0000 UTC’, doing initial heartbeat rounds”

does it show that repeatedly? It may well be that this node became the leader and so did an initial heartbeat round.

negative, it goes on with the excluding offline node messages, which is the offline leader.

Perhaps around here is the culprit?

Line 206 in lxd/cluster/heartbeat.go
// Replace the local raft_nodes table immediately because it
// might miss a row containing ourselves, since we might have
// been elected leader before the former leader had chance to
// send us a fresh update through the heartbeat pool.
logger.Debugf(“Heartbeat updating local raft nodes to %+v”, raftNodes)
err = g.db.Transaction(func(tx *db.NodeTx) error {
return tx.RaftNodesReplace(raftNodes)
})
if err != nil {
logger.Warnf(“Failed to replace local raft nodes: %v”, err)
return
}

We have a fix which will be rolling out in the next couple of hours.

In the stable channel now.

Those affected should:

  • snap refresh lxd
  • lxc config unset cluster.offline_threshold

And check that everything is behaving again.

3 Likes

Thank you! I will test it tomorrow and let you know.

Heartbeats now are in sync however the raft_node ids still don’t match the node ids.

Should I try to patch again or just leave it?

Everything is operational now, and i did the following:

  1. “lxc config unset cluster.offline_threshold” to undo the setting from yesterday.
  2. The following problem still occurs: Cluster node appears offline after upgrade to 3.15 = Raft ids differ from node ids.

Okay, so lxc cluster list says everything is offline after unsetting offline_threshold?

The ids will differ between nodes and raft_nodes, after spending a bunch of time tracking down how all that stuff works, it’s normal and the fix we’ve pushed out today makes it so that both the nodes and raft_nodes ids are sent in the heartbeats to accommodate that scenario.

2 Likes

Everything is online after unsetting the offline_treshold.

Thanks for the raft-nodes fix. That’s the only thing right now. :grin: