RAFT cluster is unavailable after Upgrade to 4.0.9

jkm · March 8, 2022, 2:53pm

On one of our clusters, after the upgrade from 4.0.8 to 4.0.9 all but one node have some clustering issues:

root@node9:~# lxc cluster list
Error: RAFT cluster is unavailable

root@node1:~# lxc cluster list
+--------+------------------------+----------+--------+-------------------+--------------+
|  NAME  |          URL           | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+--------+------------------------+----------+--------+-------------------+--------------+
| node9  | https://10.0.4.33:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node7  | https://10.0.4.32:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node6  | https://10.0.4.34:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node8  | https://10.0.4.36:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node3  | https://10.0.4.29:8443 | NO       | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+
| node1  | https://10.0.4.35:8443 | YES      | ONLINE | Fully operational | x86_64       |
+--------+------------------------+----------+--------+-------------------+--------------+

The cluster itself is working though. The SQL output is the same on all nodes too:

root@node9:~# lxd sql local "SELECT * FROM raft_nodes"
+----+----------------+------+--------+
| id |    address     | role |  name  |
+----+----------------+------+--------+
| 1  | 10.0.4.35:8443 | 0    | node1  |
| 3  | 10.0.4.29:8443 | 2    | node3  |
| 6  | 10.0.4.34:8443 | 1    | node6  |
| 7  | 10.0.4.32:8443 | 1    | node7  |
| 8  | 10.0.4.36:8443 | 0    | node8  |
| 9  | 10.0.4.33:8443 | 0    | node9  |
+----+----------------+------+--------+

root@node9:~# lxd sql global "SELECT * FROM nodes"
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| id |  name  | description |    address     | schema | api_extensions |           heartbeat            | pending | arch |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+
| 1  | node1  |             | 10.0.4.35:8443 | 30     | 220            | 2022-03-08T14:36:38.438963893Z | 0       | 2    |
| 3  | node3  |             | 10.0.4.29:8443 | 30     | 220            | 2022-03-08T14:36:36.873354791Z | 0       | 2    |
| 6  | node6  |             | 10.0.4.34:8443 | 30     | 220            | 2022-03-08T14:36:34.439250471Z | 0       | 2    |
| 7  | node7  |             | 10.0.4.32:8443 | 30     | 220            | 2022-03-08T14:36:40.236046313Z | 0       | 2    |
| 8  | node8  |             | 10.0.4.36:8443 | 30     | 220            | 2022-03-08T14:36:36.557923058Z | 0       | 2    |
| 9  | node9  |             | 10.0.4.33:8443 | 30     | 220            | 2022-03-08T14:36:34.090774806Z | 0       | 2    |
+----+--------+-------------+----------------+--------+----------------+--------------------------------+---------+------+

systemctl reload snap.lxd.daemon doesn’t help. The log doesn’t show anything interesting but t=2022-03-08T14:22:35+0000 lvl=eror msg="Failed to get leader node address" err="RAFT cluster is unavailable"

Any ideas?

stgraber · March 8, 2022, 6:33pm

Output above all looks good. Can you look at lxd.log on all servers for anything weird looking?

It may be also be good to run kill -9 $(cat /var/snap/lxd/common/lxd/lxd.pid) on all servers to get a clean state and see if things still misbehave then.

tomp · March 8, 2022, 8:06pm

lxd.pid?

stgraber · March 8, 2022, 11:29pm

yes, fixed

jkm · March 9, 2022, 1:02pm

kill -9 $(cat /var/snap/lxd/common/lxd.pid)

That fixed it, amazing! Thanks