LXD Failed Cluster Upgrade - unable to update some nodes

Using LXD clustering on a 5-node cluster, which seems to have failed on automatic refreshing to 4.0.0 (snap 14709).

Now I have 3 nodes which are saying they’re waiting for the other two to update, while the other two are saying they’re behind and need to catch up. I suspect the ones that are down are database nodes, as they’re the first ones I set up, but I don’t know how many replicas are in the cluster normally.

How do I get myself out of this hole?

lxc cluster list isn’t available at present.

Nodes needing updates are saying the following in their logs
t=2020-04-21T08:42:27+0000 lvl=eror msg=“Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node’s version is behind, please upgrade”

Nodes on current version are saying the following:
t=2020-04-21T09:00:23+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”

Make sure that all your nodes have refreshed to the latest snap. That alone should unblock both nodes waiting for other nodes to be upgraded, and nodes blocked because their version is too old.

Possibly, restart all lxd daemons after the snap refresh, in case they got stuck in some weird state.

It should not matter which nodes are database nodes and which are not (for the record there are 3 replicas).

That’s where I am currently - the failed ones are refusing to start during the refresh with the same complaint about being behind. They’re currently on snap 14503, which also claims to be 4.0.0

As far as I can tell from reading the source code the message:

this node’s version is behind, please upgrade

can be returned only if the node you are starting has a version older than one or more other nodes in the cluster.

Might be a slightly deeper issue then - the node it tries to contact immediately before that step is also down, which might be the true source of that message.

t=2020-04-21T09:52:43+0000 lvl=info msg=“Initializing global database”
t=2020-04-21T09:52:43+0000 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: Head “https://172.16.11.1:8443/internal/database”: dial tcp 172.16.11.1:8443: connect: connection refused address=172.16.11.1:8443 attempt=0”
t=2020-04-21T09:52:43+0000 lvl=eror msg=“Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node’s version is behind, please upgrade”
t=2020-04-21T09:52:43+0000 lvl=info msg=“Starting shutdown sequence”

Trying the same on 172.16.11.1 also has the same result, amusingly - I guess that’s the node I managed to get everything connected to when I built the cluster originally, so they’re all looking to it to establish the cluster state.

Just been around again with snap info, and I think I got the original post wrong - I have 3 nodes on snap 14503 and 2 on 14709.

This shouldn’t be an issue. The connection will be retried with some other node. If you get all the way to:

Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node’s version is behind, please upgrade

it means the cluster database is health (i.e. there is a majority of database nodes online).

I guess that means my two running nodes are database nodes, which is reassuring at least.

Is there a way to tell the offline nodes to try connecting to one of those nodes, instead of the 3rd (currently down) node?

Failing that, is there a manual schema update process I can use to force things?

It doesn’t mean that the two running nodes are db nodes. It means that at least one of them is: for example both your nodes with lower version might be database nodes, as soon as you start one of them the database becomes available, at that point the node being started checks its version compared to other nodes and aborts. That’s another possibility.

As I said, the offline nodes will contact online nodes automatically, don’t worry about that.

Just upgrade everything to the same version and you should be fine.

My problem is that they are refusing to update when ordered - unless there’s some mechanism beyond snap refresh that I can use.

What do you see in the logs when running “snap refresh lxd” on a node which is behind?

On my first node, I get the following:

t=2020-04-21T11:27:12+0000 lvl=info msg="LXD 4.0.0 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-04-21T11:27:12+0000 lvl=info msg="Kernel uid/gid map:"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - u 0 0 4294967295"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - g 0 0 4294967295"
t=2020-04-21T11:27:12+0000 lvl=info msg="Configured LXD uid/gid map:"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-04-21T11:27:12+0000 lvl=info msg="Kernel features:"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - uevent injection: no"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - seccomp listener: no"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - cgroup layout: hybrid"
t=2020-04-21T11:27:12+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting,     swap limits will be ignored"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - shiftfs support: disabled"
t=2020-04-21T11:27:12+0000 lvl=info msg="Initializing local database"
t=2020-04-21T11:27:12+0000 lvl=info msg="Starting /dev/lxd handler:"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-04-21T11:27:12+0000 lvl=info msg="REST API daemon:"
t=2020-04-21T11:27:12+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-04-21T11:27:12+0000 lvl=info msg=" - binding TCP socket" socket=172.16.0.10:8443
t=2020-04-21T11:27:12+0000 lvl=info msg="Initializing global database"
t=2020-04-21T11:27:12+0000 lvl=warn msg="Dqlite: server unavailable err=failed to establish network connection: Head \"https://172.16.11.1:8443/internal/database\": dial tcp 172.16.11.1:8443: connect: connection refused address=172.16.11.1:8443 attempt=0"
t=2020-04-21T11:27:13+0000 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade"
t=2020-04-21T11:27:13+0000 lvl=info msg="Starting shutdown sequence"
t=2020-04-21T11:27:13+0000 lvl=info msg="Stopping REST API handler:"
t=2020-04-21T11:27:13+0000 lvl=info msg=" - closing socket" socket=172.16.0.10:8443
t=2020-04-21T11:27:13+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-04-21T11:27:13+0000 lvl=info msg="Stopping /dev/lxd handler:"
t=2020-04-21T11:27:13+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

The second node is identical, except for the following line immediately before the node version behind error:

t=2020-04-21T11:24:50+0000 lvl=warn msg="Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.16.11.1:8443 attempt=0"

It looks to me like they both don’t know about the other nodes at this point, so there may be a discovery phase that I’m not getting to. I suspect 11.1 is the original cluster node (from when clustering was a 3.0/edge feature).

This doesn’t make sense to me. I repeat: if you get

this node's version is behind

it means that the lxd binary that the process is executing does not come from the latest snap.

Welcome to my rather frustrating morning - I’ve had issues where aborting the failed update and refreshing again cleared things up, but this is just a hard refusal to behave itself.

/var/snap/lxd/14709 does exist during the upgrade, so I think it’s trying to install the updated version. It doesn’t stick around long enough for me to do anything to hold it there, so I’m not 100% on that.

If /var/snap/lxd/14709 exists, you shoud have some log from that snap version (even perhaps just the logs of the snap upgrade procedure, which should be in the unit journal). I don’t see how the logs you pasted could possibly come from 14709, unless I’m missing something.

You and me both - those log entries are coming from /var/snap/lxd/common/lxd/logs/lxd.log, but the journal (from journalctl --unit=snap.lxd.daemon) doesn’t have anything different - just less of it. 14709 exists for less than a minute, before being wiped by the rollback.

If you can see snapd reports, I can pass across the GUID from earlier - I’ve just found those in the system journal.

Please can you paste the full journalctl --unit=snap.lxd.daemon output emitted during the refresh attempt? with some trailing lines before the refresh started and after it ended.

The following is repeated a few times. Your theory about this actually being a 14503 reinstall might be correct, looking at it:

Apr 21 13:01:15 cloud10 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
Apr 21 13:01:15 cloud10 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 21 13:01:15 cloud10 systemd[1]: Failed to start Service for snap application lxd.daemon.
Apr 21 13:01:32 cloud10 systemd[1]: Started Service for snap application lxd.daemon.
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: => Preparing the system (14503)
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Loading snap configuration
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up mntns symlink (mnt:[4026533101])
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up mount propagation on /var/snap/lxd/common/lxd/storage-pools
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up mount propagation on /var/snap/lxd/common/lxd/devices
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up persistent shmounts path
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up kmod wrapper
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Preparing /boot
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Preparing a clean copy of /run
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Preparing a clean copy of /etc
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up ceph configuration
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up LVM configuration
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Rotating logs
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Setting up ZFS (0.7)
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Escaping the systemd cgroups
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ====> Detected cgroup V1
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Escaping the systemd process resource limits
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Disabling shiftfs on this kernel (auto)
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: => Re-using existing LXCFS
Apr 21 13:01:32 cloud10 lxd.daemon[12089]: ==> Cleaning up existing LXCFS namespace
Apr 21 13:01:33 cloud10 lxd.daemon[12089]: => Starting LXD
Apr 21 13:01:34 cloud10 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Apr 21 13:01:34 cloud10 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 21 13:01:34 cloud10 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 21 13:01:34 cloud10 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 1.
Apr 21 13:01:34 cloud10 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 21 13:01:34 cloud10 systemd[1]: Started Service for snap application lxd.daemon.

The restart counter goes up to 7, but it takes less than 15 seconds to go through it all.

Until that message does not show 14709 (iow until the target snap version is 14709), there’s no way the upgrade can succeed.

There’s something very weird going on now then - I’ve manually downloaded snap 14709 (snap download lxd --stable, if someone finds this later) and installing that still prepares the system for 14503. Is there a hash available that I can check to see if the right version got downloaded?