Using LXD clustering on a 5-node cluster, which seems to have failed on automatic refreshing to 4.0.0 (snap 14709).
Now I have 3 nodes which are saying they’re waiting for the other two to update, while the other two are saying they’re behind and need to catch up. I suspect the ones that are down are database nodes, as they’re the first ones I set up, but I don’t know how many replicas are in the cluster normally.
How do I get myself out of this hole?
lxc cluster list isn’t available at present.
Nodes needing updates are saying the following in their logs
t=2020-04-21T08:42:27+0000 lvl=eror msg=“Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node’s version is behind, please upgrade”
Nodes on current version are saying the following:
t=2020-04-21T09:00:23+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”
Make sure that all your nodes have refreshed to the latest snap. That alone should unblock both nodes waiting for other nodes to be upgraded, and nodes blocked because their version is too old.
Possibly, restart all lxd daemons after the snap refresh, in case they got stuck in some weird state.
It should not matter which nodes are database nodes and which are not (for the record there are 3 replicas).
That’s where I am currently - the failed ones are refusing to start during the refresh with the same complaint about being behind. They’re currently on snap 14503, which also claims to be 4.0.0
Might be a slightly deeper issue then - the node it tries to contact immediately before that step is also down, which might be the true source of that message.
t=2020-04-21T09:52:43+0000 lvl=info msg=“Initializing global database”
t=2020-04-21T09:52:43+0000 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: Head “https://172.16.11.1:8443/internal/database”: dial tcp 172.16.11.1:8443: connect: connection refused address=172.16.11.1:8443 attempt=0”
t=2020-04-21T09:52:43+0000 lvl=eror msg=“Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node’s version is behind, please upgrade”
t=2020-04-21T09:52:43+0000 lvl=info msg=“Starting shutdown sequence”
Trying the same on 172.16.11.1 also has the same result, amusingly - I guess that’s the node I managed to get everything connected to when I built the cluster originally, so they’re all looking to it to establish the cluster state.
It doesn’t mean that the two running nodes are db nodes. It means that at least one of them is: for example both your nodes with lower version might be database nodes, as soon as you start one of them the database becomes available, at that point the node being started checks its version compared to other nodes and aborts. That’s another possibility.
As I said, the offline nodes will contact online nodes automatically, don’t worry about that.
Just upgrade everything to the same version and you should be fine.
The second node is identical, except for the following line immediately before the node version behind error:
t=2020-04-21T11:24:50+0000 lvl=warn msg="Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.16.11.1:8443 attempt=0"
It looks to me like they both don’t know about the other nodes at this point, so there may be a discovery phase that I’m not getting to. I suspect 11.1 is the original cluster node (from when clustering was a 3.0/edge feature).
Welcome to my rather frustrating morning - I’ve had issues where aborting the failed update and refreshing again cleared things up, but this is just a hard refusal to behave itself.
/var/snap/lxd/14709 does exist during the upgrade, so I think it’s trying to install the updated version. It doesn’t stick around long enough for me to do anything to hold it there, so I’m not 100% on that.
If /var/snap/lxd/14709 exists, you shoud have some log from that snap version (even perhaps just the logs of the snap upgrade procedure, which should be in the unit journal). I don’t see how the logs you pasted could possibly come from 14709, unless I’m missing something.
You and me both - those log entries are coming from /var/snap/lxd/common/lxd/logs/lxd.log, but the journal (from journalctl --unit=snap.lxd.daemon) doesn’t have anything different - just less of it. 14709 exists for less than a minute, before being wiped by the rollback.
If you can see snapd reports, I can pass across the GUID from earlier - I’ve just found those in the system journal.
Please can you paste the full journalctl --unit=snap.lxd.daemon output emitted during the refresh attempt? with some trailing lines before the refresh started and after it ended.
There’s something very weird going on now then - I’ve manually downloaded snap 14709 (snap download lxd --stable, if someone finds this later) and installing that still prepares the system for 14503. Is there a hash available that I can check to see if the right version got downloaded?