Snap Update to 4.1 Broke my cluster

freeekanayaka · June 3, 2020, 8:57am

Please can you paste (or mail me) the full lxd.log of all three nodes? Make sure debug mode is turned on (snap set lxd daemon.debug=true).

freeekanayaka · June 5, 2020, 8:17pm

To all users that might hit this: the issue here is very likely the one fixed by https://github.com/canonical/raft/pull/142, which will be included in LXD 4.2 and backported to 4.0.

If you experience it during the upgrade, one thing you can attempt before the more heavy-weight recovery procedure described above is this:

Stop all your lxd daemons (e.g. pkill -9 lxd; systemctl stop snap.lxd.daemon; systemctl stop snap.lxd.daemon.unix.socket)
Run sqlite3 /var/snap/lxd/common/lxd/database/local.db "SELECT * FROM raft_nodes WHERE role=0"
Pick one and only one of the nodes returned by 1. and run mv /var/snap/lxd/common/lxd/database/global/metadata1 /var/snap/lxd/common/lxd/database/global/metadata2 /your/backup/dir on that node
Restart each node (e.g. systemctl start snap.lxd.daemon.unix.socket; systemctl start snap.lxd.daemon)

That should bring your cluster back.

If things are still stuck there might be a pending snap refresh that never completed, you can figure its ID out with snap changes lxd and cancel it with snap abort <ID>. Then run again snap refresh lxd.

stgraber · June 5, 2020, 10:09pm

snap refresh lxd --candidate will get you on the candidate channel with LXD 4.2 as we intend to release it on Monday.