Excluding offline node from refresh, but everything seems ok

We have a LXD cluster with 6 nodes. All running snap lxd 4.10 (Ubuntu 20) at the moment. On all nodes we see this message in syslog, repeatedly, roughly 5 every 5 minutes. Even on the node that is mentioned as “offline”.

Jan 27 10:46:37 angstel lxd.daemon[1275737]: t=2021-01-27T10:46:37+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:0 Raft:true LastHeartbeat:2021-01-27 10:45:40.16286586 +0100 CET Online:false updated:false}"

As I said, everything appears to be OK, but I’m not sure if that is completely true.

$ sudo lxc cluster list
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
|  NAME   |            URL            | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE | FAILURE DOMAIN |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| angstel | https://172.16.16.76:8443 | NO       | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| ijssel  | https://172.16.16.54:8443 | YES      | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| luts    | https://172.16.16.45:8443 | YES      | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| maas    | https://172.16.16.20:8443 | YES      | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| rijn    | https://172.16.16.59:8443 | NO       | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+
| roer    | https://172.16.16.33:8443 | NO       | ONLINE | fully operational | x86_64       | default        |
+---------+---------------------------+----------+--------+-------------------+--------------+----------------+

My question: where should I look to identify why we see this warning?

I’d start by making sure that all systems have their clocks in sync.

Then if that’s the case, it could be some kind of network weirdness preventing the current leader from reaching that other system or some odd LXD bug.
I’d probably run systemctl reload snap.lxd.daemon on all systems just to make sure it’s not a LXD issue, if it persists, then probably look closer at the network side of things.

The clocks are in sync. These systems are clients in a FreeIPA network, and they require synced clocks.

Before I do systemctl reload snap.lxd.daemon, will the containers continue to run when I do that?

yep, only the API goes down during a reload

Let me add that the node reported as “Excluding offline node” is always the same, luts at 172.16.16.45. And notice, it is one of the nodes with DATABASE=YES.

On all 6 nodes I see the very same warning, with the same details.

Anyway, I did a reload of that luts system. After that, behavior is still the same. And it is still a database node.

Then I did a reload on the next system ijssel, also a database node. Now this one is not a database node anymore. It is taken over by another node.

Which makes me wonder. If a database node gets a reload, is it then expected that some other node takes over the role of database node? If that is true, then this did not happen on luts, the suspicious node.

Hmm. I did a reload on a third system. It was a raft node, as I think a database node is called. Now that node keeps giving “Replace current raft node” messages.

Jan 28 17:22:04 maas lxd.daemon[415728]: t=2021-01-28T17:22:04+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:14 maas lxd.daemon[415728]: t=2021-01-28T17:22:14+0100 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by}]"

Normally during a clean shutdown the voter role will be transitioned to another node if there’s one available. So it’s indeed a bit suspicious that it didn’t happen when you reloaded the problematic server.

Did you confirm that it actually restarted LXD? (Look at the LXD process run time)

Yes it restarted.

Jan 28 17:11:56 maas lxd.daemon[2135069]: => LXD is reloading
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 1.
Jan 28 17:11:56 maas systemd[1]: Stopped Service for snap application lxd.daemon.
Jan 28 17:11:56 maas systemd[1]: Started Service for snap application lxd.daemon.
Jan 28 17:11:56 maas lxd.daemon[415216]: => Preparing the system (19009)
Jan 28 17:11:56 maas lxd.daemon[415216]: ==> Loading snap configuration
...

Now that I look at the logs of this node I think it has debug logging enabled. Did I do that myself? How do I switch it off?

Jan 28 17:12:00 maas lxd.daemon[415216]: => Starting LXD
Jan 28 17:12:00 maas lxd.daemon[415728]: t=2021-01-28T17:12:00+0100 lvl=info msg="LXD 4.10 is starting in normal mode" >
...
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=info msg="Initializing global database"
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Found cert" name=0
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.2>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.3>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Dqlite: attempt 0: server 172.16.16.4>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Firewall xtables detected iptables is>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=info msg="Firewall loaded driver \"xtables\""
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.54:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.45:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.59:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.33:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.76:8443 of stat>

Yes, I did it myself, and totally forgot about it. This is in my history:

$ sudo snap set lxd daemon.debug=true
$ systemctl reload snap.lxd.daemon

It looks like the nodes are happy now. At least, they are all silent, not reporting warnings or errors.

@stgraber
Alas, after being quiet for several hours the “Excluding” message for 176.16.16.45 (luts) are back. I’m going to unmark Solution, sorry

Somewhere during the night on 172.16.16.20 (maas) I see these messages

Jan 29 04:19:10 maas lxd.daemon[472609]: t=2021-01-29T04:19:10+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.20:8443: no known leader"
Jan 29 04:19:10 maas lxd.daemon[472609]: t=2021-01-29T04:19:10+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.33:8443: no known leader"
Jan 29 04:19:13 maas lxd.daemon[472609]: t=2021-01-29T04:19:13+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:19:14 maas lxd.daemon[472609]: t=2021-01-29T04:19:14+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
...
Jan 29 04:19:24 maas lxd.daemon[472609]: t=2021-01-29T04:19:24+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:19:25 maas lxd.daemon[472609]: t=2021-01-29T04:19:25+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:21:01 maas lxd.daemon[472609]: t=2021-01-29T04:21:01+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:20:35.837700425 +0100 CET Online:false updated:false}"
Jan 29 04:21:42 maas lxd.daemon[472609]: t=2021-01-29T04:21:42+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:21:14.140526142 +0100 CET Online:false updated:false}"
Jan 29 04:26:35 maas lxd.daemon[472609]: t=2021-01-29T04:26:35+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:26:05.525810415 +0100 CET Online:false updated:false}"

From that moment it spits out the Excluding messages. And you may already have guessed, at 4:19 there was a “snap refresh” on 172.16.16.45 (luts). As far as I can see there are just normal messages for the reload.

Some more analysis.

BTW. Do you remember I accidentally uninstalled snap lxd on a system on Jan 6? That was an unfortunate sequence of keyboard mixup. With your help I managed to get it back online, apparently without much damage. All systems (6 cluster nodes, and containers) were behaving normal.

Then on Jan 11 there was a snap refresh, lxd 18772 => 18884. After that the continuous flow of Excluding 172.16.16.45 messages are being issued, in bursts of 5 minutes. So, my cock-up on Jan 6 caused something. The question is: what? Strangely it stopped for a few hours yesterday after I manually reloaded 3 of the 6 nodes, but the problem came back after a snap refresh on that suspicious system 172.16.16.45 (luts).

This morning, I did a reload on 172.16.16.45 (luts), but that didn’t help. Maybe I will do a reload on the other nodes as well, but I feel it is like voodoo magic.

@stgraber Sorry to “ping”, but did you see that the problem is still there?

This is odd, maybe the output of lxd sql global "SELECT * FROM nodes;" on any of the system and lxd sql local "SELECT * FROM raft_nodes;" on each of the systems would help understand what’s going on?

lxd sql global "SELECT * FROM nodes;"
+----+---------+-------------+-------------------+--------+----------------+-------------------------------------+---------+------+-------------------+
| id |  name   | description |      address      | schema | api_extensions |              heartbeat              | pending | arch | failure_domain_id |
+----+---------+-------------+-------------------+--------+----------------+-------------------------------------+---------+------+-------------------+
| 1  | ijssel  |             | 172.16.16.54:8443 | 44     | 225            | 2021-02-01T17:21:14.548302897+01:00 | 0       | 2    | <nil>             |
| 2  | luts    |             | 172.16.16.45:8443 | 44     | 225            | 2021-02-01T17:20:33.803567612+01:00 | 0       | 2    | <nil>             |
| 3  | rijn    |             | 172.16.16.59:8443 | 44     | 225            | 2021-02-01T17:21:14.548483184+01:00 | 0       | 2    | <nil>             |
| 4  | roer    |             | 172.16.16.33:8443 | 44     | 225            | 2021-02-01T17:21:14.548636835+01:00 | 0       | 2    | <nil>             |
| 5  | maas    |             | 172.16.16.20:8443 | 44     | 225            | 2021-02-01T17:21:14.548779208+01:00 | 0       | 2    | <nil>             |
| 6  | angstel |             | 172.16.16.76:8443 | 44     | 225            | 2021-02-01T17:21:14.54891832+01:00  | 0       | 2    | <nil>             |
+----+---------+-------------+-------------------+--------+----------------+-------------------------------------+---------+------+-------------------+
root@maas:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+
root@ijssel:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+
root@rijn:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+
root@luts:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+
root@roer:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+
root@angstel:~# lxd sql local "SELECT * FROM raft_nodes;"
+----+-------------------+------+
| id |      address      | role |
+----+-------------------+------+
| 1  | 172.16.16.54:8443 | 1    |
| 2  | 172.16.16.45:8443 | 0    |
| 3  | 172.16.16.59:8443 | 2    |
| 4  | 172.16.16.33:8443 | 0    |
| 5  | 172.16.16.20:8443 | 1    |
| 6  | 172.16.16.76:8443 | 0    |
+----+-------------------+------+

That’s looking good, are you still seeing those Unable to connect type messages on regular interval on one of the nodes?

They should all be logging the excluding ... message but only the leader should be logging the unable to connect. Once you identify which one is the leader, you can try connecting to port 8443 of the excluded machine from there to see what may be causing issues.

The only Unable to connect messages that I saw since last Friday were given when I did a reload this morning. Other than that there were no Unable messages.

O, and BTW what I mentioned four days about these Unable to connect messages. That was on a system where I had debug enabled (left over from a previous investigation). I have disabled debug since then.

Would it maybe help if I enable debug on 172.16.16.45?

Run lxc monitor --type=logging --pretty on each of them, one of them should show you those Unable to connect messages after a while.

The monitor API allows for access to local debug messages even when the daemon isn’t running in debug mode.

Waw, that’s massive logging.

On one of them I see that it is sending heartbeats. That’s the leader I guess.

DBUG[02-01|22:48:09] Heartbeat updating local raft nodes to [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:spare} {ID:4 Address:172.16.16.33:8443 Role:voter} {ID:5 Address:172.16.16.20:8443 Role:stand-by} {ID:6 Address:172.16.16.76:8443 Role:voter}] 
DBUG[02-01|22:48:09] Starting heartbeat round 
DBUG[02-01|22:48:09] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:48:09] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:48:09] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:48:10] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:48:10] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:48:10] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:48:11] Found cert                               name=0
DBUG[02-01|22:48:12] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:48:12] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:48:12] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:48:14] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:48:14] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:48:14] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:48:16] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:48:16] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:48:16] Successful heartbeat for 172.16.16.45:8443 
DBUG[02-01|22:48:16] Completed heartbeat round 

You can see that all 5 respond normal to the heartbeat.

But roughly every 5 minutes things go crazy:

DBUG[02-01|22:54:55] Starting heartbeat round 
DBUG[02-01|22:54:55] Found cert                               name=0
DBUG[02-01|22:54:55] Heartbeat updating local raft nodes to [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:spare} {ID:4 Address:172.16.16.33:8443 Role:voter} {ID:5 Address:172.16.16.20:8443 Role:stand-by} {ID:6 Address:172.16.16.76:8443 Role:voter}] 
DBUG[02-01|22:54:55] Starting heartbeat round 
DBUG[02-01|22:54:55] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:54:55] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:54:55] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:54:55] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:54:55] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:54:55] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:54:55] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:54:55] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:54:55] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:54:55] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:54:56] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.45:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:54:56] Completed heartbeat round 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.45:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:54:56] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:54:56] Completed heartbeat round 

Next, a few normal heartbeat round and then

DBUG[02-01|22:55:39] Starting heartbeat round 
DBUG[02-01|22:55:39] Heartbeat updating local raft nodes to [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:spare} {ID:4 Address:172.16.16.33:8443 Role:voter} {ID:5 Address:172.16.16.20:8443 Role:stand-by} {ID:6 Address:172.16.16.76:8443 Role:voter}] 
DBUG[02-01|22:55:40] Sending heartbeat to 172.16.16.20:8443 
DBUG[02-01|22:55:40] Sending heartbeat request to 172.16.16.20:8443 
DBUG[02-01|22:55:40] Successful heartbeat for 172.16.16.20:8443 
DBUG[02-01|22:55:41] Sending heartbeat request to 172.16.16.45:8443 
DBUG[02-01|22:55:41] Sending heartbeat to 172.16.16.45:8443 
DBUG[02-01|22:55:42] Found cert                               name=0
DBUG[02-01|22:55:42] Found cert                               name=0
DBUG[02-01|22:55:42] Sending heartbeat to 172.16.16.54:8443 
DBUG[02-01|22:55:42] Sending heartbeat request to 172.16.16.54:8443 
DBUG[02-01|22:55:43] Successful heartbeat for 172.16.16.54:8443 
DBUG[02-01|22:55:43] Failed heartbeat for 172.16.16.45:8443: failed to send HTTP request: Put "https://172.16.16.45:8443/internal/database": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
DBUG[02-01|22:55:43] Sending heartbeat to 172.16.16.59:8443 
DBUG[02-01|22:55:43] Sending heartbeat request to 172.16.16.59:8443 
DBUG[02-01|22:55:43] Successful heartbeat for 172.16.16.59:8443 
DBUG[02-01|22:55:45] Sending heartbeat request to 172.16.16.76:8443 
DBUG[02-01|22:55:45] Sending heartbeat to 172.16.16.76:8443 
DBUG[02-01|22:55:45] Successful heartbeat for 172.16.16.76:8443 
DBUG[02-01|22:55:45] Completed heartbeat round 

That’s very odd, I can’t really explain the odd batches of hearbeats on the leader other than it maybe getting overloaded and building up a backlog?

The actual error causing that one system to be considered offline is a timeout, so it suggests that the leader failed to reach the internal database endpoint for a few seconds and so marked the server as bad.

Any issue with high load on the leader or on that node that keeps being marked offline?