Please can you paste (or mail me) the full lxd.log
of all three nodes? Make sure debug mode is turned on (snap set lxd daemon.debug=true
).
To all users that might hit this: the issue here is very likely the one fixed by https://github.com/canonical/raft/pull/142, which will be included in LXD 4.2 and backported to 4.0.
If you experience it during the upgrade, one thing you can attempt before the more heavy-weight recovery procedure described above is this:
- Stop all your lxd daemons (e.g.
pkill -9 lxd; systemctl stop snap.lxd.daemon; systemctl stop snap.lxd.daemon.unix.socket
) - Run
sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes WHERE role=0"
- Pick one and only one of the nodes returned by 1. and run
mv /var/snap/lxd/common/lxd/database/global/metadata1 /var/snap/lxd/common/lxd/database/global/metadata2 /your/backup/dir
on that node - Restart each node (e.g.
systemctl start snap.lxd.daemon.unix.socket; systemctl start snap.lxd.daemon
)
That should bring your cluster back.
If things are still stuck there might be a pending snap refresh that never completed, you can figure its ID out with snap changes lxd
and cancel it with snap abort <ID>
. Then run again snap refresh lxd
.
snap refresh lxd --candidate
will get you on the candidate channel with LXD 4.2 as we intend to release it on Monday.