Greetings!
I need a help in diagnosis lxd cluster problem. The problem is cluster becomes inoperable randomly. It seems the reason in freezing one of nodes, which blocks whole cluster, and is not associated with a specific node.
Cluster contains four hardware node. All nodes in local 10Mbit network segment and runs openSUSE Leap 15.2, lxd version 4.17. All nodes uses dir as local storage pool backend. Filesystem used is ext4 (one node) and BTRFS (all other). There are no hardware problem on nodes.
At now cluster is inoperable. Output of systemctl status lxd:
one node: ноя 22 09:02:28 node202 lxd[8467]: t=2021-11-22T09:02:28+0300 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=1h2m12.904521676s interval=10s ноя 22 09:02:56 node202 lxd[8467]: t=2021-11-22T09:02:56+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:57 node202 lxd[8467]: t=2021-11-22T09:02:57+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:58 node202 lxd[8467]: t=2021-11-22T09:02:58+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:59 node202 lxd[8467]: t=2021-11-22T09:02:59+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:00 node202 lxd[8467]: t=2021-11-22T09:03:00+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:01 node202 lxd[8467]: t=2021-11-22T09:03:01+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:02 node202 lxd[8467]: t=2021-11-22T09:03:02+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:04 node202 lxd[8467]: t=2021-11-22T09:03:04+0300 lvl=warn msg="Failed heartbeat" address=10.59.0.200:8443 err="Failed to send heartbeat request: Put \"https://10.59.0.200:8443/internal/database\": dia> ноя 22 09:03:24 node202 lxd[8467]: t=2021-11-22T09:03:24+0300 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=25.274550855s interval=10s
all other nodes: ноя 22 11:33:44 node201 nice[7528]: t=2021-11-22T11:33:44+0300 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:45 node201 nice[7528]: t=2021-11-22T11:33:45+0300 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:45 node201 nice[7528]: t=2021-11-22T11:33:45+0300 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:55 node201 nice[7528]: t=2021-11-22T11:33:55+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:20 node201 nice[7528]: t=2021-11-22T11:34:20+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:44 node201 nice[7528]: t=2021-11-22T11:34:44+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:44 node201 nice[7528]: t=2021-11-22T11:34:44+0300 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:46 node201 nice[7528]: t=2021-11-22T11:34:46+0300 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:46 node201 nice[7528]: t=2021-11-22T11:34:46+0300 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 22 11:35:10 node201 nice[7528]: t=2021-11-22T11:35:10+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress"
Restarting lxd on node 10.59.0.200 will solve problem.
Another error I seens in similar situation is “Failed to create operation: database is locked”. Restart appropriate node also solved problem.
Please can you try upgrading to LXD 4.20 as there were some fixes added then related to DB timeouts that could cause a lock to be held for a long time when a cluster member became unreachable.
Well, I built lxd 4.20 from source and update it on all nodes. Then I completely remove /var/lib/lxd, restart lxd everywhere and rebuilt cluster.
All looking fine:
But when I tried to use a new cluster, the situation described above is repeated.
systemctl status lxd ... ноя 24 14:38:38 klg-wg-server lxd[32332]: t=2021-11-24T14:38:38+0200 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 24 14:38:44 klg-wg-server lxd[32332]: t=2021-11-24T14:38:44+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:38:47 klg-wg-server lxd[32332]: t=2021-11-24T14:38:47+0200 lvl=warn msg="Failed getting raft nodes" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:10 klg-wg-server lxd[32332]: t=2021-11-24T14:39:10+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:12 klg-wg-server lxd[32332]: t=2021-11-24T14:39:12+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:35 klg-wg-server lxd[32332]: t=2021-11-24T14:39:35+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:36 klg-wg-server lxd[32332]: t=2021-11-24T14:39:36+0200 lvl=warn msg="Failed getting raft nodes" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:37 klg-wg-server lxd[32332]: t=2021-11-24T14:39:37+0200 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:38 klg-wg-server lxd[32332]: t=2021-11-24T14:39:38+0200 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:40 klg-wg-server lxd[32332]: t=2021-11-24T14:39:40+0200 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress"
Cluster is inoperable again. In lxd logs are records about some offline members:
... t=2021-11-24T14:51:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=2 address=10.59.0.200:8443 lastHeartbeat=2021-11-24T13:50:45+0200 raftID=2 t=2021-11-24T14:51:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:50:38+0200 raftID=3 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=4 address=10.59.0.202:8443 lastHeartbeat=2021-11-24T13:52:11+0200 raftID=4 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=2 address=10.59.0.200:8443 lastHeartbeat=2021-11-24T13:52:10+0200 raftID=2 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:51:57+0200 raftID=3 t=2021-11-24T15:00:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:59:44+0200 raftID=3
This nodes are physically online and lxd daemon is loaded on it.
It’s a SQLite checkpoint that will not write to disk normally, it moves transactions from the WAL to the main database, but these are all in-memory in dqlite’s case. (see Write-Ahead Logging. )
@solo Can you start lxd with the environment variables LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 on the problematic node, this will write a bunch of tracing to stderr, might help me identify the issue.