Heartbeat timeouts after upgrade to 4.18

theblazehen · September 14, 2021, 1:19pm

After the snap update to 4.18 rev 21497 yesterday, I’ve got an issue where a significant amount of nodes are showing as offline - No heartbeat.

I can log in to those nodes and see requests from the leader (hetzner-06), eg on host hostkey-misc-phys-1

t=2021-09-14T13:17:42+0000 lvl=dbug msg="Matched trusted cert" fingerprint=684bc7ee0154b2e1c0710fa6064771a67eb369f077ad3d2709520d3196473581 subject="CN=root@hetzner-06,O=linuxcontainers.org"
t=2021-09-14T13:17:45+0000 lvl=dbug msg="Replace current raft nodes with [{NodeInfo:{ID:16 Address:hetzner-12:8443 Role:stand-by} Name:hetzner-12} {NodeInfo:{ID:21 Address:es-hel-phys-1:8443 Role:stand-by} Name:es-hel-phys-1} {NodeInfo:{ID:27 Address:app-hel-phys-3:8443 Role:spare} Name:app-hel-phys-3} {NodeInfo:{ID:17 Address:hostkey-inference-1:8443 Role:spare} Name:hostkey-inference-1} {NodeInfo:{ID:23 Address:es-fsn-phys-2:8443 Role:stand-by} Name:es-fsn-phys-2} {NodeInfo:{ID:24 Address:es-fsn-phys-1:8443 Role:spare} Name:es-fsn-phys-1} {NodeInfo:{ID:25 Address:app-hel-phys-2:8443 Role:spare} Name:app-hel-phys-2} {NodeInfo:{ID:12 Address:monitoring:8443 Role:spare} Name:monitoring} {NodeInfo:{ID:13 Address:hetzner-10:8443 Role:stand-by} Name:hetzner-10} {NodeInfo:{ID:18 Address:hostkey-misc-phys-1:8443 Role:spare} Name:hostkey-misc-phys-1} {NodeInfo:{ID:19 Address:es-hel-phys-3:8443 Role:spare} Name:es-hel-phys-3} {NodeInfo:{ID:20 Address:es-hel-phys-2:8443 Role:spare} Name:es-hel-phys-2} {NodeInfo:{ID:26 Address:app-hel-phys-1:8443 Role:spare} Name:app-hel-phys-1} {NodeInfo:{ID:3 Address:hetzner-03:8443 Role:stand-by} Name:hetzner-03} {NodeInfo:{ID:4 Address:hetzner-04:8443 Role:voter} Name:hetzner-04} {NodeInfo:{ID:6 Address:hetzner-06:8443 Role:voter} Name:hetzner-06} {NodeInfo:{ID:1 Address:hetzner-01:8443 Role:spare} Name:hetzner-01} {NodeInfo:{ID:14 Address:hetzner-11:8443 Role:spare} Name:hetzner-11} {NodeInfo:{ID:28 Address:app-hel-phys-4:8443 Role:spare} Name:app-hel-phys-4} {NodeInfo:{ID:7 Address:hetzner-staging:8443 Role:voter} Name:hetzner-staging}]"

However on the leader it doesn’t show a successful query:
t=2021-09-14T13:15:03+0000 lvl=warn msg="Failed heartbeat" address=hostkey-misc-phys-1:8443 err="Failed to send heartbeat request: Put \"https://hostkey-misc-phys-1:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

To rule out an issue with a single node, the leader was hetzner-05 which I have shut down, so now the leader is hetzner-06 where the issue persists. I can curl the endpoints of all nodes just fine from hetzner-06

tomp · September 14, 2021, 1:27pm

Please make sure all members are online and reachable and that they are all running 4.18.

theblazehen · September 14, 2021, 1:35pm

All hosts are online and on the correct version

> for host in $(lxc cluster list | grep -v -- ---- | grep -v URL | awk '{print $4}'); do echo -n $host; curl -k $host; done
https://app-hel-phys-1:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://app-hel-phys-2:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://app-hel-phys-3:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://app-hel-phys-4:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://es-fsn-phys-1:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://es-fsn-phys-2:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://es-hel-phys-1:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://es-hel-phys-2:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://es-hel-phys-3:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-01:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-03:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-04:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-06:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-10:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-11:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-12:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hetzner-staging:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hostkey-inference-1:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://hostkey-misc-phys-1:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
https://monitoring:8443{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}

tomp · September 14, 2021, 1:37pm

I cant see the version from your output above, but I’ll take your word for it that they are all on 4.18.

What version were you running before?

Also, have you checked that that the problem member is reachable from the leader?

tomp · September 14, 2021, 1:38pm

Can you show output of lxc cluster ls please.

theblazehen · September 14, 2021, 1:41pm

4.17

Yes, all hosts are reachable from the leader using same command as above.

From the leader:

+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
|        NAME         |               URL                | DATABASE | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                    MESSAGE                                    |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| app-hel-phys-1      | https://app-hel-phys-1:8443      | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h19m19.001318382s (2021-09-14 10:22:15.2361866 +0000 UTC)   |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| app-hel-phys-2      | https://app-hel-phys-2:8443      | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 4h38m2.798546576s (2021-09-14 09:03:31.437157662 +0000 UTC)  |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| app-hel-phys-3      | https://app-hel-phys-3:8443      | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h18m54.012882832s (2021-09-14 10:22:40.226431721 +0000 UTC) |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| app-hel-phys-4      | https://app-hel-phys-4:8443      | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h18m2.273058206s (2021-09-14 10:23:31.96852392 +0000 UTC)   |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| es-fsn-phys-1       | https://es-fsn-phys-1:8443       | NO       | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| es-fsn-phys-2       | https://es-fsn-phys-2:8443       | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| es-hel-phys-1       | https://es-hel-phys-1:8443       | YES      | x86_64       | default        |             | OFFLINE | No heartbeat for 3h17m26.615072531s (2021-09-14 10:24:07.615577135 +0000 UTC) |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| es-hel-phys-2       | https://es-hel-phys-2:8443       | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h17m16.82999001s (2021-09-14 10:24:17.398801359 +0000 UTC)  |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| es-hel-phys-3       | https://es-hel-phys-3:8443       | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h16m52.17459563s (2021-09-14 10:24:42.052378392 +0000 UTC)  |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-01          | https://hetzner-01:8443          | NO       | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-03          | https://hetzner-03:8443          | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-04          | https://hetzner-04:8443          | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-06          | https://hetzner-06:8443          | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-10          | https://hetzner-10:8443          | NO       | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-11          | https://hetzner-11:8443          | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-12          | https://hetzner-12:8443          | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hetzner-staging     | https://hetzner-staging:8443     | YES      | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hostkey-inference-1 | https://hostkey-inference-1:8443 | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h15m39.263148786s (2021-09-14 10:25:54.960991865 +0000 UTC) |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| hostkey-misc-phys-1 | https://hostkey-misc-phys-1:8443 | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h16m48.037695904s (2021-09-14 10:24:46.18786016 +0000 UTC)  |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| monitoring          | https://monitoring:8443          | NO       | x86_64       | default        |             | OFFLINE | No heartbeat for 3h12m37.701125737s (2021-09-14 10:28:56.516906013 +0000 UTC) |
+---------------------+----------------------------------+----------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+

tomp · September 14, 2021, 1:48pm

Thanks.

So the first thing that stands out to me as somewhat unusual is the use of hostnames for the URL of the cluster members, this means that the cluster operation is dependent on DNS.

I also note that members that are unreachable all appear to be in external locations than (what I presume) is where the leader is running in the hetzner-nn location. This is something that is not directly supported by dqlite (the underlying clustered DB system that LXD uses) and assumes a low latency between sites.

So please can you confirm that:

DNS is all correctly resolving in all sites for all names.
What the latency is between each of the sites.

tomp · September 14, 2021, 1:50pm

Also can you provide the output of:

sudo ss -tlpn | grep lxd
lxc config show

On each of the unreachable members as well as the leader. Thanks

theblazehen · September 14, 2021, 2:04pm

These are all handled in /etc/hosts, and all hosts are reflected correctly.

We’ve got around 25ms between them. Is that too high?

> for host in hetzner-06 app-hel-phys-{1..4} es-hel-phys-{1..3} hostkey-inference-1 hostkey-misc-phys-1 monitoring; do echo $host; ssh $host 'sudo ss -tlpn | grep lxd; sudo lxc config show'; echo; echo; done
hetzner-06
LISTEN 0      4096                       10.145.193.230:8443       0.0.0.0:*     users:(("lxd",pid=936721,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe48:4c0c]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=936871,fd=9))
config:
  cluster.https_address: hetzner-06:8443
  cluster.max_standby: "5"
  core.https_address: hetzner-06:8443
  core.trust_password: true


app-hel-phys-1
LISTEN 0      4096                        10.145.12.137:8443       0.0.0.0:*     users:(("lxd",pid=3944833,fd=20))
LISTEN 0      4096                                    *:443              *:*     users:(("lxd",pid=912542,fd=8),("lxd",pid=912542,fd=3))
LISTEN 0      4096                                    *:80               *:*     users:(("lxd",pid=912222,fd=8),("lxd",pid=912222,fd=3))
LISTEN 0      32     [fe80::216:3eff:fe8a:7173]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=3945192,fd=9))
config:
  cluster.https_address: app-hel-phys-1:8443
  cluster.max_standby: "5"
  core.https_address: app-hel-phys-1:8443
  core.trust_password: true


app-hel-phys-2
LISTEN 0      4096                        10.145.8.216:8443        0.0.0.0:*     users:(("lxd",pid=3831922,fd=20))
LISTEN 0      4096                                   *:80                *:*     users:(("lxd",pid=855259,fd=8),("lxd",pid=855259,fd=3))
LISTEN 0      32     [fe80::216:3eff:fef0:42f]%lxdfan0:53             [::]:*     users:(("dnsmasq",pid=3832468,fd=9))
LISTEN 0      4096                                   *:443               *:*     users:(("lxd",pid=855532,fd=8),("lxd",pid=855532,fd=3))
config:
  cluster.https_address: app-hel-phys-2:8443
  cluster.max_standby: "5"
  core.https_address: app-hel-phys-2:8443
  core.trust_password: true


app-hel-phys-3
LISTEN 0      4096                         10.145.10.19:8443       0.0.0.0:*     users:(("lxd",pid=3724608,fd=20))
LISTEN 0      4096                                    *:80               *:*     users:(("lxd",pid=410646,fd=8),("lxd",pid=410646,fd=3))
LISTEN 0      32     [fe80::216:3eff:fea2:26ca]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=3724835,fd=9))
LISTEN 0      4096                                    *:443              *:*     users:(("lxd",pid=410700,fd=8),("lxd",pid=410700,fd=3))
config:
  cluster.https_address: app-hel-phys-3:8443
  cluster.max_standby: "5"
  core.https_address: app-hel-phys-3:8443
  core.trust_password: true


app-hel-phys-4
LISTEN 0      4096                         10.145.16.58:8443       0.0.0.0:*     users:(("lxd",pid=978064,fd=20))
LISTEN 0      4096                                    *:587              *:*     users:(("lxd",pid=764190,fd=8),("lxd",pid=764190,fd=3))
LISTEN 0      4096                                    *:80               *:*     users:(("lxd",pid=359800,fd=8),("lxd",pid=359800,fd=3))
LISTEN 0      32     [fe80::216:3eff:fe77:f702]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=978265,fd=9))
LISTEN 0      4096                                    *:443              *:*     users:(("lxd",pid=359843,fd=8),("lxd",pid=359843,fd=3))
config:
  cluster.https_address: app-hel-phys-4:8443
  cluster.max_standby: "5"
  core.https_address: app-hel-phys-4:8443
  core.trust_password: true


es-hel-phys-1
LISTEN 0      4096                      10.145.188.122:8443        0.0.0.0:*     users:(("lxd",pid=594550,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe3e:75b]%lxdfan0:53             [::]:*     users:(("dnsmasq",pid=594711,fd=9))
config:
  cluster.https_address: es-hel-phys-1:8443
  cluster.max_standby: "5"
  core.https_address: es-hel-phys-1:8443
  core.trust_password: true


es-hel-phys-2
LISTEN 0      4096                        10.145.187.63:8443       0.0.0.0:*     users:(("lxd",pid=1127772,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe6d:14dd]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=1127935,fd=9))
config:
  cluster.https_address: es-hel-phys-2:8443
  cluster.max_standby: "5"
  core.https_address: es-hel-phys-2:8443
  core.trust_password: true


es-hel-phys-3
LISTEN 0      4096                         10.145.186.4:8443       0.0.0.0:*     users:(("lxd",pid=1214468,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe59:bd0e]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=1214620,fd=9))
config:
  cluster.https_address: es-hel-phys-3:8443
  cluster.max_standby: "5"
  core.https_address: es-hel-phys-3:8443
  core.trust_password: true


hostkey-inference-1
LISTEN 0      4096                         10.145.35.10:8443       0.0.0.0:*     users:(("lxd",pid=2361259,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe7d:59a1]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=2361484,fd=9))
config:
  cluster.https_address: hostkey-inference-1:8443
  cluster.max_standby: "5"
  core.https_address: hostkey-inference-1:8443
  core.trust_password: true


hostkey-misc-phys-1
LISTEN 0      4096                       10.145.178.114:8443       0.0.0.0:*     users:(("lxd",pid=64455,fd=20))
LISTEN 0      32     [fe80::216:3eff:fe72:6165]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=64591,fd=9))
config:
  cluster.https_address: hostkey-misc-phys-1:8443
  cluster.max_standby: "5"
  core.https_address: hostkey-misc-phys-1:8443
  core.trust_password: true


monitoring
LISTEN0      4096                         10.145.38.135:8443       0.0.0.0:*     users:(("lxd",pid=9409,fd=20))
LISTEN0      32      [fe80::216:3eff:fec7:cd11]%lxdfan0:53            [::]:*     users:(("dnsmasq",pid=10266,fd=9))
config:
  cluster.https_address: monitoring:8443
  cluster.max_standby: "5"
  core.https_address: monitoring:8443
  core.trust_password: true

tomp · September 14, 2021, 2:12pm

25ms should be OK (it was on 4.17 apparently at least).

Can you look in your logs for “Restarting heartbeat”?

The heartbeat timeout is a combination of which ever takes less time: 1s per member or the total time of cluster.offline_threshold/2 for all members.

So try setting the cluster.offline_threshold to 60s as the default is 20s, which would only allow 10s to complete a heartbeat round before it is cancelled.

lxc config set cluster.offline_threshold=60

theblazehen · September 14, 2021, 2:20pm

> grep -i 'restarting heartbeat' /var/snap/lxd/common/lxd/logs/lxd.log
t=2021-09-14T12:59:13+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T12:59:16+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T12:59:19+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T12:59:58+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:00:02+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:00:20+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:02:12+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:07:16+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:07:19+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:08:19+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:15:16+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:15:19+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:32:16+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T13:32:18+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T14:12:07+0000 lvl=info msg="Restarting heartbeat" member=6
t=2021-09-14T14:12:11+0000 lvl=info msg="Restarting heartbeat" member=6

Yep, I do see some of that.

Applied that change, but still no luck.

t=2021-09-14T14:18:33+0000 lvl=warn msg="Failed heartbeat" address=app-hel-phys-3:8443 err="Failed to send heartbeat request: Put \"https://app-hel-phys-3:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
t=2021-09-14T14:18:34+0000 lvl=warn msg="Failed heartbeat" address=app-hel-phys-4:8443 err="Failed to send heartbeat request: Put \"https://app-hel-phys-4:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Is this timeout set to 1s or something?

tomp · September 14, 2021, 2:31pm

OK that’s a problem, can you increase cluster.offline_threshold to something higher, such as 300 and see if that helps.

Yes the timeout is set to 1s.

Is it possible to get access to this cluster? Because the “Restarting heartbeat” log message also suggests one or more of the members is sending member information to the leader as if it is the leader or joining.

tomp · September 14, 2021, 2:33pm

Can you see any of these in the logs?

“Triggering an out of schedule hearbeat”?

theblazehen · September 14, 2021, 2:35pm

> grep -i 'Triggering an out of schedule hearbeat' /var/snap/lxd/common/lxd/logs/lxd.log
t=2021-09-14T14:15:14+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=app-hel-phys-1:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=app-hel-phys-2:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=app-hel-phys-3:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=app-hel-phys-4:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=es-fsn-phys-1:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=es-fsn-phys-2:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=es-hel-phys-1:8443
t=2021-09-14T14:15:15+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=es-hel-phys-2:8443
t=2021-09-14T14:15:16+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=es-hel-phys-3:8443
t=2021-09-14T14:15:16+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=hetzner-01:8443
t=2021-09-14T14:15:16+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=hetzner-03:8443
t=2021-09-14T14:15:16+0000 lvl=dbug msg="Triggering an out of schedule hearbeat" address=hetzner-04:8443

Yes, can do so

tomp · September 14, 2021, 2:37pm

https://launchpad.net/~tomparrott/+sshkeys

theblazehen · September 14, 2021, 2:40pm

Would you mind going through a https://tmate.io/ session, so I can have an eye on what’s being done as well, or do you prefer direct access?

tomp · September 14, 2021, 2:42pm

Thats fine, let me know the URL/token?

tomp · September 14, 2021, 2:48pm

OK so we are seeing this:

t=2021-09-14T14:47:18+0000 lvl=dbug msg="Completed heartbeat round" address=hetzner-04:8443 duration=2m26.994117122s

So that’s good, a complete heartbeat round is taking place from the leader and taking 2m267.

And we can say that the heartbeats to the problem members are taking too long:

t=2021-09-14T14:48:40+0000 lvl=dbug msg="Sending heartbeat request" address=app-hel-phys-2:8443
t=2021-09-14T14:48:42+0000 lvl=warn msg="Failed heartbeat" address=app-hel-phys-2:8443 err="Failed to send heartbeat request: Put \"https://app-hel-phys-2:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

That is being triggered by the 1s timeout.

theblazehen · September 14, 2021, 2:53pm

> time  curl -k https://es-hel-phys-2:8443/
{"type":"sync","status":"Success","status_code":200,"operation":"","error_code":0,"error":"","metadata":["/1.0"]}
curl -k https://es-hel-phys-2:8443/  0.02s user 0.01s system 15% cpu 0.193 total

Running a direct curl from hetzner-04 (Current leader) doesn’t seem to be close to the 1s timeout though

tomp · September 14, 2021, 2:55pm

Looking on one of the problem members, I see this:

t=2021-09-14T14:51:42+0000 lvl=dbug msg="Matched trusted cert" fingerprint=65bf8a0cdcf8118bb62b09e541dd6f1cd36269804294602ce95166aa687eb1a6 subject="CN=root@hetzner-04,O=linuxcontainers.org"
t=2021-09-14T14:51:46+0000 lvl=dbug msg="Replace current raft nodes with [{NodeInfo:{ID:18 Address:hostkey-misc-phys-1:8443 Role:stand-by} Name:hostkey-misc-phys-1} {NodeInfo:{ID:26 Address:app-hel-phys-1:8443 Role:spare} Name:app-hel-phys-1} {NodeInfo:{ID:7 Address:hetzner-staging
:8443 Role:spare} Name:hetzner-staging} {NodeInfo:{ID:14 Address:hetzner-11:8443 Role:voter} Name:hetzner-11} {NodeInfo:{ID:19 Address:es-hel-phys-3:8443 Role:spare} Name:es-hel-phys-3} {NodeInfo:{ID:24 Address:es-fsn-phys-1:8443 Role:spare} Name:es-fsn-phys-1} {NodeInfo:{ID:27 Add
ress:app-hel-phys-3:8443 Role:spare} Name:app-hel-phys-3} {NodeInfo:{ID:28 Address:app-hel-phys-4:8443 Role:spare} Name:app-hel-phys-4} {NodeInfo:{ID:13 Address:hetzner-10:8443 Role:stand-by} Name:hetzner-10} {NodeInfo:{ID:12 Address:monitoring:8443 Role:spare} Name:monitoring} {No
deInfo:{ID:16 Address:hetzner-12:8443 Role:stand-by} Name:hetzner-12} {NodeInfo:{ID:20 Address:es-hel-phys-2:8443 Role:spare} Name:es-hel-phys-2} {NodeInfo:{ID:4 Address:hetzner-04:8443 Role:voter} Name:hetzner-04} {NodeInfo:{ID:6 Address:hetzner-06:8443 Role:spare} Name:hetzner-06
} {NodeInfo:{ID:1 Address:hetzner-01:8443 Role:spare} Name:hetzner-01} {NodeInfo:{ID:21 Address:es-hel-phys-1:8443 Role:stand-by} Name:es-hel-phys-1} {NodeInfo:{ID:23 Address:es-fsn-phys-2:8443 Role:stand-by} Name:es-fsn-phys-2} {NodeInfo:{ID:25 Address:app-hel-phys-2:8443 Role:spa
re} Name:app-hel-phys-2} {NodeInfo:{ID:3 Address:hetzner-03:8443 Role:voter} Name:hetzner-03} {NodeInfo:{ID:17 Address:hostkey-inference-1:8443 Role:spare} Name:hostkey-inference-1}]" 
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=26 address=app-hel-phys-1:8443 lastHeartbeat=2021-09-14T10:22:15+0000 raftID=26
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=18 address=hostkey-misc-phys-1:8443 lastHeartbeat=2021-09-14T10:24:46+0000 raftID=18
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=27 address=app-hel-phys-3:8443 lastHeartbeat=2021-09-14T10:22:40+0000 raftID=27
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=28 address=app-hel-phys-4:8443 lastHeartbeat=2021-09-14T10:23:31+0000 raftID=28
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=19 address=es-hel-phys-3:8443 lastHeartbeat=2021-09-14T10:24:42+0000 raftID=19
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=20 address=es-hel-phys-2:8443 lastHeartbeat=2021-09-14T10:24:17+0000 raftID=20
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=12 address=monitoring:8443 lastHeartbeat=2021-09-14T14:15:52+0000 raftID=12
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=25 address=app-hel-phys-2:8443 lastHeartbeat=2021-09-14T09:03:31+0000 raftID=25
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=17 address=hostkey-inference-1:8443 lastHeartbeat=2021-09-14T10:25:54+0000 raftID=17
t=2021-09-14T14:51:46+0000 lvl=warn msg="Excluding offline member from refresh" ID=21 address=es-hel-phys-1:8443 lastHeartbeat=2021-09-14T10:24:07+0000 raftID=21

So this shows the heartbeat from hetzner-04 is arriving and being handled correctly.
Interestingly though, the time between the initial connection log entry:

t=2021-09-14T14:51:42+0000 lvl=dbug msg="Matched trusted cert" fingerprint=65bf8a0cdcf8118bb62b09e541dd6f1cd36269804294602ce95166aa687eb1a6 subject="CN=root@hetzner-04,O=linuxcontainers.org"

and the heartbeat handler log:

t=2021-09-14T14:51:46+0000 lvl=dbug msg="Replace current raft nodes with...

Is greater than 2s. This delay may well be the issue, I need to check if that log is happening after the heartbeat response has been sent back.