RAFT cluster is unavailable after Upgrade to 4.0.9

On one of our clusters, after the upgrade from 4.0.8 to 4.0.9 all but one node have some clustering issues:

root@node9:~# lxc cluster list
Error: RAFT cluster is unavailable
root@node1:~# lxc cluster list
+--------+------------------------+----------+--------+-------------------+--------------+
|  NAME  |          URL           | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+--------+------------------------+----------+--------+-------------------+--------------+
| node9  | https://10.0.4.33:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node7  | https://10.0.4.32:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node6  | https://10.0.4.34:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node8  | https://10.0.4.36:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node3  | https://10.0.4.29:8443 | NO       | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node1  | https://10.0.4.35:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+

The cluster itself is working though. The SQL output is the same on all nodes too:

root@node9:~# lxd sql local "SELECT * FROM raft_nodes"
+----+----------------+------+--------+
| id |    address     | role |  name  |
+----+----------------+------+--------+
| 1  | 10.0.4.35:8443 | 0    | node1  |
| 3  | 10.0.4.29:8443 | 2    | node3  |
| 6  | 10.0.4.34:8443 | 1    | node6  |
| 7  | 10.0.4.32:8443 | 1    | node7  |
| 8  | 10.0.4.36:8443 | 0    | node8  |
| 9  | 10.0.4.33:8443 | 0    | node9  |
+----+----------------+------+--------+
root@node9:~# lxd sql global "SELECT * FROM nodes"
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| id |  name  | description |    address     | schema | api_extensions |           heartbeat            | pending | arch |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| 1  | node1  |             | 10.0.4.35:8443 | 30     | 220            | 2022-03-08T14:36:38.438963893Z | 0       | 2    |
| 3  | node3  |             | 10.0.4.29:8443 | 30     | 220            | 2022-03-08T14:36:36.873354791Z | 0       | 2    |
| 6  | node6  |             | 10.0.4.34:8443 | 30     | 220            | 2022-03-08T14:36:34.439250471Z | 0       | 2    |
| 7  | node7  |             | 10.0.4.32:8443 | 30     | 220            | 2022-03-08T14:36:40.236046313Z | 0       | 2    |
| 8  | node8  |             | 10.0.4.36:8443 | 30     | 220            | 2022-03-08T14:36:36.557923058Z | 0       | 2    |
| 9  | node9  |             | 10.0.4.33:8443 | 30     | 220            | 2022-03-08T14:36:34.090774806Z | 0       | 2    |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+

systemctl reload snap.lxd.daemon doesn’t help. The log doesn’t show anything interesting but t=2022-03-08T14:22:35+0000 lvl=eror msg="Failed to get leader node address" err="RAFT cluster is unavailable"

Any ideas?

Output above all looks good. Can you look at lxd.log on all servers for anything weird looking?

It may be also be good to run kill -9 $(cat /var/snap/lxd/common/lxd/lxd.pid) on all servers to get a clean state and see if things still misbehave then.

lxd.pid?

yes, fixed :slight_smile:

kill -9 $(cat /var/snap/lxd/common/lxd.pid) :wink:

That fixed it, amazing! Thanks