After the snap update to 4.18 rev 21497 yesterday, I’ve got an issue where a significant amount of nodes are showing as offline - No heartbeat.
I can log in to those nodes and see requests from the leader (hetzner-06
), eg on host hostkey-misc-phys-1
t=2021-09-14T13:17:42+0000 lvl=dbug msg="Matched trusted cert" fingerprint=684bc7ee0154b2e1c0710fa6064771a67eb369f077ad3d2709520d3196473581 subject="CN=root@hetzner-06,O=linuxcontainers.org"
t=2021-09-14T13:17:45+0000 lvl=dbug msg="Replace current raft nodes with [{NodeInfo:{ID:16 Address:hetzner-12:8443 Role:stand-by} Name:hetzner-12} {NodeInfo:{ID:21 Address:es-hel-phys-1:8443 Role:stand-by} Name:es-hel-phys-1} {NodeInfo:{ID:27 Address:app-hel-phys-3:8443 Role:spare} Name:app-hel-phys-3} {NodeInfo:{ID:17 Address:hostkey-inference-1:8443 Role:spare} Name:hostkey-inference-1} {NodeInfo:{ID:23 Address:es-fsn-phys-2:8443 Role:stand-by} Name:es-fsn-phys-2} {NodeInfo:{ID:24 Address:es-fsn-phys-1:8443 Role:spare} Name:es-fsn-phys-1} {NodeInfo:{ID:25 Address:app-hel-phys-2:8443 Role:spare} Name:app-hel-phys-2} {NodeInfo:{ID:12 Address:monitoring:8443 Role:spare} Name:monitoring} {NodeInfo:{ID:13 Address:hetzner-10:8443 Role:stand-by} Name:hetzner-10} {NodeInfo:{ID:18 Address:hostkey-misc-phys-1:8443 Role:spare} Name:hostkey-misc-phys-1} {NodeInfo:{ID:19 Address:es-hel-phys-3:8443 Role:spare} Name:es-hel-phys-3} {NodeInfo:{ID:20 Address:es-hel-phys-2:8443 Role:spare} Name:es-hel-phys-2} {NodeInfo:{ID:26 Address:app-hel-phys-1:8443 Role:spare} Name:app-hel-phys-1} {NodeInfo:{ID:3 Address:hetzner-03:8443 Role:stand-by} Name:hetzner-03} {NodeInfo:{ID:4 Address:hetzner-04:8443 Role:voter} Name:hetzner-04} {NodeInfo:{ID:6 Address:hetzner-06:8443 Role:voter} Name:hetzner-06} {NodeInfo:{ID:1 Address:hetzner-01:8443 Role:spare} Name:hetzner-01} {NodeInfo:{ID:14 Address:hetzner-11:8443 Role:spare} Name:hetzner-11} {NodeInfo:{ID:28 Address:app-hel-phys-4:8443 Role:spare} Name:app-hel-phys-4} {NodeInfo:{ID:7 Address:hetzner-staging:8443 Role:voter} Name:hetzner-staging}]"
However on the leader it doesn’t show a successful query:
t=2021-09-14T13:15:03+0000 lvl=warn msg="Failed heartbeat" address=hostkey-misc-phys-1:8443 err="Failed to send heartbeat request: Put \"https://hostkey-misc-phys-1:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
To rule out an issue with a single node, the leader was hetzner-05 which I have shut down, so now the leader is hetzner-06 where the issue persists. I can curl the endpoints of all nodes just fine from hetzner-06