Snap Update to 4.1 Broke my cluster

Please can you paste (or mail me) the full lxd.log of all three nodes? Make sure debug mode is turned on (snap set lxd daemon.debug=true).

To all users that might hit this: the issue here is very likely the one fixed by https://github.com/canonical/raft/pull/142, which will be included in LXD 4.2 and backported to 4.0.

If you experience it during the upgrade, one thing you can attempt before the more heavy-weight recovery procedure described above is this:

  1. Stop all your lxd daemons (e.g. pkill -9 lxd; systemctl stop snap.lxd.daemon; systemctl stop snap.lxd.daemon.unix.socket)
  2. Run sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes WHERE role=0"
  3. Pick one and only one of the nodes returned by 1. and run mv /var/snap/lxd/common/lxd/database/global/metadata1 /var/snap/lxd/common/lxd/database/global/metadata2 /your/backup/dir on that node
  4. Restart each node (e.g. systemctl start snap.lxd.daemon.unix.socket; systemctl start snap.lxd.daemon)

That should bring your cluster back.

If things are still stuck there might be a pending snap refresh that never completed, you can figure its ID out with snap changes lxd and cancel it with snap abort <ID>. Then run again snap refresh lxd.

snap refresh lxd --candidate will get you on the candidate channel with LXD 4.2 as we intend to release it on Monday.