Cluster dead, all machines 'waiting for other cluster nodes to upgrade their versions'

I noticed that LXD had hung again, so I went through the usual ‘reboot everything until snap figures out what to do’, but now all my hosts are dead:

t=2019-09-18T06:48:30+1000 lvl=info msg="LXD 3.17 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-09-18T06:48:30+1000 lvl=info msg="Kernel uid/gid map:"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - u 0 0 4294967295"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - g 0 0 4294967295"
t=2019-09-18T06:48:30+1000 lvl=info msg="Configured LXD uid/gid map:"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - u 0 1000000 1000000000"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - g 0 1000000 1000000000"
t=2019-09-18T06:48:30+1000 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
t=2019-09-18T06:48:30+1000 lvl=info msg="Kernel features:"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - netnsid-based network retrieval: no"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - uevent injection: no"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - seccomp listener: no"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - shiftfs support: no"
t=2019-09-18T06:48:30+1000 lvl=info msg="Initializing local database"
t=2019-09-18T06:48:30+1000 lvl=info msg="Starting /dev/lxd handler:"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-09-18T06:48:30+1000 lvl=info msg="REST API daemon:"
t=2019-09-18T06:48:30+1000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-09-18T06:48:30+1000 lvl=info msg=" - binding TCP socket" socket=10.61.0.21:8443
t=2019-09-18T06:48:30+1000 lvl=info msg="Initializing global database"
t=2019-09-18T06:48:31+1000 lvl=info msg="Wait for other cluster nodes to upgrade their versions"
root@lxc-01:~#

All machines are at 3.17/11964 and I can’t get ANY of them to start. Help?

It talks here about other nodes that need to upgrade their version.
Can you check with snap info lxd that all nodes are on the same version?

You can force the upgrade of LXD to the available version at once with

sudo snap refresh lxd

I’ve done that.

I have three nodes:

root@lxc-01:~# snap refresh lxd
snap “lxd” has no updates available
root@lxc-01:~#

root@lxc-03:~# snap refresh lxd
snap “lxd” has no updates available
root@lxc-03:~#

root@lxc-05:~# snap refresh lxd
snap “lxd” has no updates available
root@lxc-05:~#

Edit: I found one of them still thinks it’s upgrading:

root@lxc-05:~# snap changes
ID   Status  Spawn                      Ready                Summary
35   Undone  7 days ago, at 06:18 AEST  today at 06:38 AEST  Auto-refresh snaps "lxd", "core"
36   Undone  today at 06:38 AEST        today at 06:42 AEST  Auto-refresh snaps "core", "lxd"
37   Done    today at 06:40 AEST        today at 06:40 AEST  Refresh all snaps: no updates
38   Done    today at 06:40 AEST        today at 06:40 AEST  Refresh all snaps: no updates
39   Doing   today at 06:44 AEST        -                    Refresh "lxd" snap

root@lxc-05:~# snap abort 39
root@lxc-05:~# snap stop lxd
error: snap "lxd" has "refresh-snap" change in progress
root@lxc-05:~# snap abort 39
root@lxc-05:~# snap stop lxd
error: snap "lxd" has "refresh-snap" change in progress
root@lxc-05:~#

But I can’t seem to convince it to actually upgrade.

I did reboot lxc-01, and it reverted back to 3.16, and I ran a refresh again, and it’s back where it was.

What I really want to do is un-cluster these nodes, because this is REALLY REALLY unreliable.

Is there any way to manually hack on some database somewhere to tell each node that it’s a member of a single-node cluster? Then I can export their containers, and rebuild the machines.

It looks like this is simply unsolveable. It’s now possible for a cluster to be unable to coldboot, leaving me in the situation where I just had to throw it all away and redeploy it all without clustering.

I have turned the VMs off, so if @stgraber (or someone on his team?) wants to investigate how we managed to get into this situation, I can give you access.

But for the moment, I would strongly recommend people do not use LXD Clustering, in any form.

Responded privately to see how we could get access to take a look.

The error messages above suggest that the different nodes all can talk to each other and the database is functional.

The fact that it still shows Wait for other cluster nodes to upgrade their version suggests that either you have a 4th cluster node in the database which shouldn’t exist anymore (dead system which wasn’t removed from the cluster) or one of the node is failing to update its database record, holding the upgrade due to inconsistent database versions.

This is what has happened. Two nodes out of 5 died, and I hadn’t had time to rebuild them.

Ok, so assuming you know their names, you can dump a /var/snap/lxd/common/lxd/database/patch.global.sql containing:

DELETE FROM nodes WHERE name='node1';
DELETE FROM nodes WHERE name='node2';

Passing the names of the two nodes that need to be kicked out.
The patch file only needs to be on one of the database servers.