Cluster blocked due node error (?)

Greetings!
I need a help in diagnosis lxd cluster problem. The problem is cluster becomes inoperable randomly. It seems the reason in freezing one of nodes, which blocks whole cluster, and is not associated with a specific node.
Cluster contains four hardware node. All nodes in local 10Mbit network segment and runs openSUSE Leap 15.2, lxd version 4.17. All nodes uses dir as local storage pool backend. Filesystem used is ext4 (one node) and BTRFS (all other). There are no hardware problem on nodes.

At now cluster is inoperable. Output of systemctl status lxd:
one node:
ноя 22 09:02:28 node202 lxd[8467]: t=2021-11-22T09:02:28+0300 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=1h2m12.904521676s interval=10s ноя 22 09:02:56 node202 lxd[8467]: t=2021-11-22T09:02:56+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:57 node202 lxd[8467]: t=2021-11-22T09:02:57+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:58 node202 lxd[8467]: t=2021-11-22T09:02:58+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:02:59 node202 lxd[8467]: t=2021-11-22T09:02:59+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:00 node202 lxd[8467]: t=2021-11-22T09:03:00+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:01 node202 lxd[8467]: t=2021-11-22T09:03:01+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:02 node202 lxd[8467]: t=2021-11-22T09:03:02+0300 lvl=warn msg="Failed to get events from member" address=10.59.0.200:8443 err="Unable to connect to: 10.59.0.200:8443" ноя 22 09:03:04 node202 lxd[8467]: t=2021-11-22T09:03:04+0300 lvl=warn msg="Failed heartbeat" address=10.59.0.200:8443 err="Failed to send heartbeat request: Put \"https://10.59.0.200:8443/internal/database\": dia> ноя 22 09:03:24 node202 lxd[8467]: t=2021-11-22T09:03:24+0300 lvl=warn msg="Heartbeat round duration greater than heartbeat interval" duration=25.274550855s interval=10s

all other nodes:
ноя 22 11:33:44 node201 nice[7528]: t=2021-11-22T11:33:44+0300 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:45 node201 nice[7528]: t=2021-11-22T11:33:45+0300 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:45 node201 nice[7528]: t=2021-11-22T11:33:45+0300 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 22 11:33:55 node201 nice[7528]: t=2021-11-22T11:33:55+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:20 node201 nice[7528]: t=2021-11-22T11:34:20+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:44 node201 nice[7528]: t=2021-11-22T11:34:44+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:44 node201 nice[7528]: t=2021-11-22T11:34:44+0300 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:46 node201 nice[7528]: t=2021-11-22T11:34:46+0300 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 22 11:34:46 node201 nice[7528]: t=2021-11-22T11:34:46+0300 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 22 11:35:10 node201 nice[7528]: t=2021-11-22T11:35:10+0300 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress"

Restarting lxd on node 10.59.0.200 will solve problem.

Another error I seens in similar situation is “Failed to create operation: database is locked”. Restart appropriate node also solved problem.

Please can you try upgrading to LXD 4.20 as there were some fixes added then related to DB timeouts that could cause a lock to be held for a long time when a cluster member became unreachable.

Ok, I will try with LXD 4.20

This will at least hopefully point to a clear error as to what is going wrong.

Well, I built lxd 4.20 from source and update it on all nodes. Then I completely remove /var/lib/lxd, restart lxd everywhere and rebuilt cluster.
All looking fine:

lxc cluster list +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+ | NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE | +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+ | klg-wg-server | https://10.59.0.1:8443 | database | x86_64 | default | | ONLINE | Fully operational | +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+ | node200 | https://10.59.0.200:8443 | database | x86_64 | default | | ONLINE | Fully operational | +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+ | node201 | https://10.59.0.201:8443 | database | x86_64 | default | | ONLINE | Fully operational | +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+ | node202 | https://10.59.0.202:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational | +---------------+--------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

But when I tried to use a new cluster, the situation described above is repeated.

systemctl status lxd ... ноя 24 14:38:38 klg-wg-server lxd[32332]: t=2021-11-24T14:38:38+0200 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress" ноя 24 14:38:44 klg-wg-server lxd[32332]: t=2021-11-24T14:38:44+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:38:47 klg-wg-server lxd[32332]: t=2021-11-24T14:38:47+0200 lvl=warn msg="Failed getting raft nodes" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:10 klg-wg-server lxd[32332]: t=2021-11-24T14:39:10+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:12 klg-wg-server lxd[32332]: t=2021-11-24T14:39:12+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:35 klg-wg-server lxd[32332]: t=2021-11-24T14:39:35+0200 lvl=warn msg="Failed to get current cluster members" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:36 klg-wg-server lxd[32332]: t=2021-11-24T14:39:36+0200 lvl=warn msg="Failed getting raft nodes" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:37 klg-wg-server lxd[32332]: t=2021-11-24T14:39:37+0200 lvl=eror msg="Failed getting volumes for auto custom volume snapshot task" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:38 klg-wg-server lxd[32332]: t=2021-11-24T14:39:38+0200 lvl=eror msg="Unable to retrieve the list of expired custom volume snapshots" err="failed to begin transaction: checkpoint in progress" ноя 24 14:39:40 klg-wg-server lxd[32332]: t=2021-11-24T14:39:40+0200 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: checkpoint in progress"

Cluster is inoperable again. In lxd logs are records about some offline members:

... t=2021-11-24T14:51:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=2 address=10.59.0.200:8443 lastHeartbeat=2021-11-24T13:50:45+0200 raftID=2 t=2021-11-24T14:51:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:50:38+0200 raftID=3 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=4 address=10.59.0.202:8443 lastHeartbeat=2021-11-24T13:52:11+0200 raftID=4 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=2 address=10.59.0.200:8443 lastHeartbeat=2021-11-24T13:52:10+0200 raftID=2 t=2021-11-24T14:52:39+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:51:57+0200 raftID=3 t=2021-11-24T15:00:13+0300 lvl=warn msg="Excluding offline member from refresh" ID=3 address=10.59.0.201:8443 lastHeartbeat=2021-11-24T13:59:44+0200 raftID=3

This nodes are physically online and lxd daemon is loaded on it.

What constitutes “using the cluster”?

What I’m really looking for here is reproducer steps.

@mbordere is 10Mbps LAN sufficient to run dqlite?

What is the latency between each cluster member? 10Mbps bandwidth suggests something other than a normal 1Gbps ethernet LAN.

Also, how many instances do you have in the cluster?

Cluster is used for integration builds, main scenario is:

  • create containers;
  • parallel build software software packages;
  • pulling building results;
  • stop and destroy containers.

Builds are automated and controlled by Jenkins server.

At now I use four instances - one per cluster node.

Network latency rtt min/avg/max/mdev = 0.125/0.138/0.160/0.012 ms

And is the 10Mbps bandwidth being saturated?

Oops, sorry, 100Mbit of course…
It is a local network segment used by small development team, it saturation insignificant.

I would expect it to be sufficient, haven’t really tested these kinds of scenarios though.

It sounds like then that your disk I/O is too slow and dqlite is getting stuck trying to perform a checkpoint.

It’s a SQLite checkpoint that will not write to disk normally, it moves transactions from the WAL to the main database, but these are all in-memory in dqlite’s case. (see Write-Ahead Logging. )

Any ideas what can cause it to block then?

@solo Can you start lxd with the environment variables LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 on the problematic node, this will write a bunch of tracing to stderr, might help me identify the issue.

May be - nodes hardware are 5-10 year old, but serviceable SATA HDD are used.

How is that done with the snap @mbordere ?

Start lxd on foreground? Or it writing to log?