On one of our clusters, after the upgrade from 4.0.8 to 4.0.9 all but one node have some clustering issues:
root@node9:~# lxc cluster list
Error: RAFT cluster is unavailable
root@node1:~# lxc cluster list
+--------+------------------------+----------+--------+-------------------+--------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
+--------+------------------------+----------+--------+-------------------+--------------+
| node9 | https://10.0.4.33:8443 | YES | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
| node7 | https://10.0.4.32:8443 | YES | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
| node6 | https://10.0.4.34:8443 | YES | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
| node8 | https://10.0.4.36:8443 | YES | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
| node3 | https://10.0.4.29:8443 | NO | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
| node1 | https://10.0.4.35:8443 | YES | ONLINE | Fully operational | x86_64 |
+--------+------------------------+----------+--------+-------------------+--------------+
The cluster itself is working though. The SQL output is the same on all nodes too:
root@node9:~# lxd sql local "SELECT * FROM raft_nodes"
+----+----------------+------+--------+
| id | address | role | name |
+----+----------------+------+--------+
| 1 | 10.0.4.35:8443 | 0 | node1 |
| 3 | 10.0.4.29:8443 | 2 | node3 |
| 6 | 10.0.4.34:8443 | 1 | node6 |
| 7 | 10.0.4.32:8443 | 1 | node7 |
| 8 | 10.0.4.36:8443 | 0 | node8 |
| 9 | 10.0.4.33:8443 | 0 | node9 |
+----+----------------+------+--------+
root@node9:~# lxd sql global "SELECT * FROM nodes"
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| 1 | node1 | | 10.0.4.35:8443 | 30 | 220 | 2022-03-08T14:36:38.438963893Z | 0 | 2 |
| 3 | node3 | | 10.0.4.29:8443 | 30 | 220 | 2022-03-08T14:36:36.873354791Z | 0 | 2 |
| 6 | node6 | | 10.0.4.34:8443 | 30 | 220 | 2022-03-08T14:36:34.439250471Z | 0 | 2 |
| 7 | node7 | | 10.0.4.32:8443 | 30 | 220 | 2022-03-08T14:36:40.236046313Z | 0 | 2 |
| 8 | node8 | | 10.0.4.36:8443 | 30 | 220 | 2022-03-08T14:36:36.557923058Z | 0 | 2 |
| 9 | node9 | | 10.0.4.33:8443 | 30 | 220 | 2022-03-08T14:36:34.090774806Z | 0 | 2 |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
systemctl reload snap.lxd.daemon
doesn’t help. The log doesn’t show anything interesting but t=2022-03-08T14:22:35+0000 lvl=eror msg="Failed to get leader node address" err="RAFT cluster is unavailable"
Any ideas?