So I’m not sure exactly where the bug is, but it seems it’s caused by this sequence of events:
- lxdnode1 (10.4.90.2), which is currently the leader, starts a heartbeat round:
Apr 29 00:25:40 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:40-0400 lvl=dbug msg="Starting heartbeat round"
- It can successfully perform the heartbeat against lxdnode3 (10.4.90.4):
Apr 29 00:25:43 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:43-0400 lvl=dbug msg="Successful heartbeat for 10.4.90.4:8443"
Apr 29 00:25:43 lxdnode3 lxd.daemon[9308]: t=2020-04-29T00:25:43-0400 lvl=dbug msg="Replace current raft nodes with [{ID:4 Address:10.4.90.4:8443 Role:voter} {ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter}]"
and against lxdnode2 (10.4.90.3):
Apr 29 00:25:45 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:45-0400 lvl=dbug msg="Sending heartbeat to 10.4.90.3:8443"
Apr 29 00:25:45 lxdnode2 lxd.daemon[12565]: t=2020-04-29T00:25:45-0400 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter} {ID:4 Address:10.4.90.4:8443 Role:voter}]"
- It tries to perform the heartbeat against starbase6 (10.4.90.5):
Apr 29 00:25:46 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:46-0400 lvl=dbug msg="Sending heartbeat to 10.4.90.5:8443"
- Around that time starbase6 seems to be kind of online, but either unresponsive (high CPU usage on the host running this VM?) or with some network issues, so the heartbeat fails:
Apr 29 00:25:31 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:31-0400 lvl=dbug msg="Rebalance cluster"
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Found cert" name=0
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter} {ID:4 Address:10.4.90.4:8443 Role:voter}]"
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Rebalance cluster"
Apr 29 00:25:49 starbase6 dnsmasq-dhcp[17134]: RTR-ADVERT(lanbridge) fd42:13c6:b2a7:eef0::
Apr 29 00:25:48 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Failed heartbeat for 10.4.90.5:8443: failed to send HTTP request: Put \"https://10.4.90.5:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
We do have integration tests that exercise a node going down and failing to respond heartbeats, but they make the node is completely offline so connections fail in the first place. However this seems to be a particular situation: starbase6 seems to accept the TCP/HTTP connection from lxdnode1 but then it appears essentially stuck from the point of view of lxdnode1, which fails the heartbeat in a way that apparently leads to some bad state.
I’ll try to tweak a little bit our time tolerance and look at that code in general.
In the meantime, how stable do you think your ESXi VMs are in terms of CPU and network? Do you think there could be hiccups lasting more than a second, where either the CPU or the network are unresponsive?