Failed to send heartbeat request

I can’t seem to figure out why my 0-r720xd node keeps failing to maintain a healthy state. Every ~3-5 min it goes down and then comes back online as if nothing bad happened.

root@nuc-server-2:~# lxc cluster list; echo; tail -n 20 /var/snap/lxd/common/lxd/logs/lxd.log
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
|     NAME     |            URL             |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                 MESSAGE                                  |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| 0-r720xd     | https://192.168.98.10:8443 |                  | x86_64       | default        |             | OFFLINE | No heartbeat for 36.743479482s (2023-05-05 05:56:57.814448343 +0000 UTC) |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| dellt30      | https://192.168.98.18:8443 | database-standby | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-1 | https://192.168.98.20:8443 |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-2 | https://192.168.98.22:8443 | database-leader  | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
|              |                            | database         |              |                |             |         |                                                                          |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-3 | https://192.168.98.24:8443 | database-standby | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-4 | https://192.168.98.26:8443 | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-5 | https://192.168.98.28:8443 | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| p700         | https://192.168.98.16:8443 |                  | x86_64       | default        |             | ONLINE  | Fully operational                                                        |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+

time="2023-05-05T05:44:47Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:33974: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:33974"
time="2023-05-05T05:44:50Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:44:59Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:29Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:38Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:46Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:55Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:46:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:46:20Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:47:54Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: connect: connection refused" remote="192.168.98.10:8443"
time="2023-05-05T05:51:26Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:39Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: i/o timeout (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:49Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:53Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:60510: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:60510"
time="2023-05-05T05:51:55Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:04Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:19Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: i/o timeout (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:27Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:31Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:34862: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:34862"
root@nuc-server-2:~# curl -ks https://192.168.98.10:8443/ | jq .
{
  "type": "sync",
  "status": "Success",
  "status_code": 200,
  "operation": "",
  "error_code": 0,
  "error": "",
  "metadata": [
    "/1.0"
  ]
}

Nothing seems to indicate there is a network or physical layer issue on the unhealthy system.

How can I troubleshoot this?

Hi,
Have you ever investigate the /var/snap/lxd/common/lxd/logs/lxd.log? Maybe there is something wrong in the log file of the cluster servers, you may check the timedatectl of each server as well.
And what are the versions of each cluster members?
Regards.

What LXD version is this?

It’s 5.12-c63881f

I managed to figure it out. Apparently netplan had some recent updates where they depreciated how they handled default routes. I have multiple interfaces on the box, so it was causing the interfaces to go down intermittently.

So, I scrubbed my netplan config and got it working.

1 Like