Cluster member won't start after upgrade

klim · July 21, 2023, 4:52pm

Upgraded from 4.10 (non-snap) to latest snap version (5.15).

Cluster consists of four nodes, did the following steps on three of them and everything went fine:

snap install lxd
lxd.migrate

Three nodes are now waiting for the last node, however lxd won’t start on the fourth.

Did the same steps but something must have failed and now lxd won’t start on the last node.

output from lxd --debug --group lxd

INFO[07-21|18:50:50] LXD 4.10 is starting in normal mode      path=/var/lib/lxd
INFO[07-21|18:50:50] Kernel uid/gid map:
INFO[07-21|18:50:50]  - u 0 0 4294967295
INFO[07-21|18:50:50]  - g 0 0 4294967295
INFO[07-21|18:50:50] Configured LXD uid/gid map:
INFO[07-21|18:50:50]  - u 0 1000000 65536
INFO[07-21|18:50:50]  - g 0 1000000 65536
INFO[07-21|18:50:50] Kernel features:
INFO[07-21|18:50:50]  - closing multiple file descriptors efficiently: yes
INFO[07-21|18:50:50]  - netnsid-based network retrieval: yes
INFO[07-21|18:50:50]  - pidfds: yes
INFO[07-21|18:50:50]  - uevent injection: yes
INFO[07-21|18:50:50]  - seccomp listener: yes
INFO[07-21|18:50:50]  - seccomp listener continue syscalls: yes
INFO[07-21|18:50:50]  - seccomp listener add file descriptors: yes
INFO[07-21|18:50:50]  - attach to namespaces via pidfds: yes
INFO[07-21|18:50:50]  - safe native terminal allocation : yes
INFO[07-21|18:50:50]  - unprivileged file capabilities: yes
INFO[07-21|18:50:50]  - cgroup layout: hybrid
WARN[07-21|18:50:50]  - Couldn't find the CGroup blkio.weight, disk priority will be ignored
INFO[07-21|18:50:50]  - shiftfs support: yes
INFO[07-21|18:50:50] Initializing local database
DBUG[07-21|18:50:50] Initializing database gateway
DBUG[07-21|18:50:50] Start database node                      role=voter id=6 address=10.0.0.3:8443
INFO[07-21|18:50:50] Starting cluster handler:
INFO[07-21|18:50:50] Starting /dev/lxd handler:
INFO[07-21|18:50:50]  - binding devlxd socket                 socket=/var/lib/lxd/devlxd/sock
INFO[07-21|18:50:50] REST API daemon:
INFO[07-21|18:50:50]  - binding Unix socket                   socket=/var/lib/lxd/unix.socket
INFO[07-21|18:50:50]  - binding TCP socket                    socket=10.0.0.3:8443
INFO[07-21|18:50:50] Initializing global database
DBUG[07-21|18:50:50] Dqlite: attempt 0: server 10.0.0.1:8443: connected
DBUG[07-21|18:50:50] Database error: &errors.errorString{s:"this node's version is behind, please upgrade"}
EROR[07-21|18:50:50] Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade
INFO[07-21|18:50:50] Starting shutdown sequence
INFO[07-21|18:50:50] Stop database gateway
INFO[07-21|18:50:50] Stopping REST API handler:
INFO[07-21|18:50:50]  - closing socket                        socket=10.0.0.3:8443
INFO[07-21|18:50:50]  - closing socket                        socket=/var/lib/lxd/unix.socket
INFO[07-21|18:50:50] Stopping /dev/lxd handler:
INFO[07-21|18:50:50]  - closing socket                        socket=/var/lib/lxd/devlxd/sock
DBUG[07-21|18:50:50] Not unmounting temporary filesystems (containers are still running)
Error: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade

klim · July 21, 2023, 5:50pm

Could be related to this issue https://github.com/canonical/lxd/issues/10210

I’m seing the error about no such column: state on one of the nodes


time="2023-07-21T19:49:55+02:00" level=warning msg="Failed getting raft nodes" err="Failed getting cluster members: Failed to fetch nodes: no such column: state"
time="2023-07-21T19:49:55+02:00" level=warning msg="Failed to get current cluster members" err="Failed to fetch nodes: no such column: state"

tomp · July 23, 2023, 3:59pm

Let’s discuss this over at https://discourse.ubuntu.com/t/cluster-member-won-t-start-after-upgrade/37162/2