We have a LXD cluster with 6 nodes. All running snap lxd 4.10 (Ubuntu 20) at the moment. On all nodes we see this message in syslog, repeatedly, roughly 5 every 5 minutes. Even on the node that is mentioned as “offline”.
Jan 27 10:46:37 angstel lxd.daemon[1275737]: t=2021-01-27T10:46:37+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:0 Raft:true LastHeartbeat:2021-01-27 10:45:40.16286586 +0100 CET Online:false updated:false}"
As I said, everything appears to be OK, but I’m not sure if that is completely true.
I’d start by making sure that all systems have their clocks in sync.
Then if that’s the case, it could be some kind of network weirdness preventing the current leader from reaching that other system or some odd LXD bug.
I’d probably run systemctl reload snap.lxd.daemon on all systems just to make sure it’s not a LXD issue, if it persists, then probably look closer at the network side of things.
Let me add that the node reported as “Excluding offline node” is always the same, luts at 172.16.16.45. And notice, it is one of the nodes with DATABASE=YES.
On all 6 nodes I see the very same warning, with the same details.
Anyway, I did a reload of that luts system. After that, behavior is still the same. And it is still a database node.
Then I did a reload on the next system ijssel, also a database node. Now this one is not a database node anymore. It is taken over by another node.
Which makes me wonder. If a database node gets a reload, is it then expected that some other node takes over the role of database node? If that is true, then this did not happen on luts, the suspicious node.
Hmm. I did a reload on a third system. It was a raft node, as I think a database node is called. Now that node keeps giving “Replace current raft node” messages.
Jan 28 17:22:04 maas lxd.daemon[415728]: t=2021-01-28T17:22:04+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter}]"
Jan 28 17:22:12 maas lxd.daemon[415728]: t=2021-01-28T17:22:12+0100 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by} {ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter}]"
Jan 28 17:22:14 maas lxd.daemon[415728]: t=2021-01-28T17:22:14+0100 lvl=dbug msg="Replace current raft nodes with [{ID:5 Address:172.16.16.20:8443 Role:spare} {ID:6 Address:172.16.16.76:8443 Role:voter} {ID:1 Address:172.16.16.54:8443 Role:stand-by} {ID:2 Address:172.16.16.45:8443 Role:voter} {ID:3 Address:172.16.16.59:8443 Role:voter} {ID:4 Address:172.16.16.33:8443 Role:stand-by}]"
Normally during a clean shutdown the voter role will be transitioned to another node if there’s one available. So it’s indeed a bit suspicious that it didn’t happen when you reloaded the problematic server.
Did you confirm that it actually restarted LXD? (Look at the LXD process run time)
Jan 28 17:11:56 maas lxd.daemon[2135069]: => LXD is reloading
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Jan 28 17:11:56 maas systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 1.
Jan 28 17:11:56 maas systemd[1]: Stopped Service for snap application lxd.daemon.
Jan 28 17:11:56 maas systemd[1]: Started Service for snap application lxd.daemon.
Jan 28 17:11:56 maas lxd.daemon[415216]: => Preparing the system (19009)
Jan 28 17:11:56 maas lxd.daemon[415216]: ==> Loading snap configuration
...
Now that I look at the logs of this node I think it has debug logging enabled. Did I do that myself? How do I switch it off?
Jan 28 17:12:00 maas lxd.daemon[415216]: => Starting LXD
Jan 28 17:12:00 maas lxd.daemon[415728]: t=2021-01-28T17:12:00+0100 lvl=info msg="LXD 4.10 is starting in normal mode" >
...
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=info msg="Initializing global database"
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Found cert" name=0
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.2>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.3>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Dqlite: attempt 0: server 172.16.16.4>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Firewall xtables detected iptables is>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=info msg="Firewall loaded driver \"xtables\""
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.54:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.45:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.59:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.33:8443 of stat>
Jan 28 17:12:01 maas lxd.daemon[415728]: t=2021-01-28T17:12:01+0100 lvl=dbug msg="Notify node 172.16.16.76:8443 of stat>
@stgraber
Alas, after being quiet for several hours the “Excluding” message for 176.16.16.45 (luts) are back. I’m going to unmark Solution, sorry
Somewhere during the night on 172.16.16.20 (maas) I see these messages
Jan 29 04:19:10 maas lxd.daemon[472609]: t=2021-01-29T04:19:10+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.20:8443: no known leader"
Jan 29 04:19:10 maas lxd.daemon[472609]: t=2021-01-29T04:19:10+0100 lvl=warn msg="Dqlite: attempt 0: server 172.16.16.33:8443: no known leader"
Jan 29 04:19:13 maas lxd.daemon[472609]: t=2021-01-29T04:19:13+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:19:14 maas lxd.daemon[472609]: t=2021-01-29T04:19:14+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
...
Jan 29 04:19:24 maas lxd.daemon[472609]: t=2021-01-29T04:19:24+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:19:25 maas lxd.daemon[472609]: t=2021-01-29T04:19:25+0100 lvl=warn msg="Failed to get events from node 172.16.16.45:8443: Unable to connect to: 172.16.16.45:8443"
Jan 29 04:21:01 maas lxd.daemon[472609]: t=2021-01-29T04:21:01+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:20:35.837700425 +0100 CET Online:false updated:false}"
Jan 29 04:21:42 maas lxd.daemon[472609]: t=2021-01-29T04:21:42+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:21:14.140526142 +0100 CET Online:false updated:false}"
Jan 29 04:26:35 maas lxd.daemon[472609]: t=2021-01-29T04:26:35+0100 lvl=warn msg="Excluding offline node from refresh: {ID:2 Address:172.16.16.45:8443 RaftID:2 RaftRole:2 Raft:true LastHeartbeat:2021-01-29 04:26:05.525810415 +0100 CET Online:false updated:false}"
From that moment it spits out the Excluding messages. And you may already have guessed, at 4:19 there was a “snap refresh” on 172.16.16.45 (luts). As far as I can see there are just normal messages for the reload.
BTW. Do you remember I accidentally uninstalled snap lxd on a system on Jan 6? That was an unfortunate sequence of keyboard mixup. With your help I managed to get it back online, apparently without much damage. All systems (6 cluster nodes, and containers) were behaving normal.
Then on Jan 11 there was a snap refresh, lxd 18772 => 18884. After that the continuous flow of Excluding 172.16.16.45 messages are being issued, in bursts of 5 minutes. So, my cock-up on Jan 6 caused something. The question is: what? Strangely it stopped for a few hours yesterday after I manually reloaded 3 of the 6 nodes, but the problem came back after a snap refresh on that suspicious system 172.16.16.45 (luts).
This morning, I did a reload on 172.16.16.45 (luts), but that didn’t help. Maybe I will do a reload on the other nodes as well, but I feel it is like voodoo magic.
This is odd, maybe the output of lxd sql global "SELECT * FROM nodes;" on any of the system and lxd sql local "SELECT * FROM raft_nodes;" on each of the systems would help understand what’s going on?
That’s looking good, are you still seeing those Unable to connect type messages on regular interval on one of the nodes?
They should all be logging the excluding ... message but only the leader should be logging the unable to connect. Once you identify which one is the leader, you can try connecting to port 8443 of the excluded machine from there to see what may be causing issues.
The only Unable to connect messages that I saw since last Friday were given when I did a reload this morning. Other than that there were no Unable messages.
O, and BTW what I mentioned four days about these Unable to connect messages. That was on a system where I had debug enabled (left over from a previous investigation). I have disabled debug since then.
Would it maybe help if I enable debug on 172.16.16.45?
That’s very odd, I can’t really explain the odd batches of hearbeats on the leader other than it maybe getting overloaded and building up a backlog?
The actual error causing that one system to be considered offline is a timeout, so it suggests that the leader failed to reach the internal database endpoint for a few seconds and so marked the server as bad.
Any issue with high load on the leader or on that node that keeps being marked offline?