I can’t seem to figure out why my 0-r720xd
node keeps failing to maintain a healthy state. Every ~3-5 min it goes down and then comes back online as if nothing bad happened.
root@nuc-server-2:~# lxc cluster list; echo; tail -n 20 /var/snap/lxd/common/lxd/logs/lxd.log
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| 0-r720xd | https://192.168.98.10:8443 | | x86_64 | default | | OFFLINE | No heartbeat for 36.743479482s (2023-05-05 05:56:57.814448343 +0000 UTC) |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| dellt30 | https://192.168.98.18:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-1 | https://192.168.98.20:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-2 | https://192.168.98.22:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-3 | https://192.168.98.24:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-4 | https://192.168.98.26:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| nuc-server-5 | https://192.168.98.28:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
| p700 | https://192.168.98.16:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------+----------------------------+------------------+--------------+----------------+-------------+---------+--------------------------------------------------------------------------+
time="2023-05-05T05:44:47Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:33974: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:33974"
time="2023-05-05T05:44:50Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:44:59Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:29Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:38Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:46Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:45:55Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:46:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:46:20Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:47:54Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: connect: connection refused" remote="192.168.98.10:8443"
time="2023-05-05T05:51:26Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:39Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: i/o timeout (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:49Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:51:53Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:60510: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:60510"
time="2023-05-05T05:51:55Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:04Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:19Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": dial tcp 192.168.98.10:8443: i/o timeout (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:27Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://192.168.98.10:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="192.168.98.10:8443"
time="2023-05-05T05:57:31Z" level=warning msg="Dqlite proxy failed" err="first: remote -> local: read tcp 192.168.98.22:8443->192.168.98.10:34862: read: connection timed out" local="192.168.98.22:8443" name=dqlite remote="192.168.98.10:34862"
root@nuc-server-2:~# curl -ks https://192.168.98.10:8443/ | jq .
{
"type": "sync",
"status": "Success",
"status_code": 200,
"operation": "",
"error_code": 0,
"error": "",
"metadata": [
"/1.0"
]
}
Nothing seems to indicate there is a network or physical layer issue on the unhealthy system.
How can I troubleshoot this?