LXD Failed Cluster Upgrade - unable to update some nodes

@stgraber I’m out of ideas here. How could a “snap refresh lxd” run end up printing:

=> Preparing the system (14503)

if the latest version of the snap in the store is 14709?

snap info lxd and snap changes may be helpful

# snap info lxd
sname:      lxd
summary:   System container manager and API
publisher: Canonical✓
store-url: https://snapcraft.io/lxd
contact:   https://github.com/lxc/lxd/issues
license:   unset
description: |
....
commands:
  - lxd.benchmark
  - lxd.buginfo
  - lxd.check-kernel
  - lxd.lxc
  - lxd
  - lxd.migrate
services:
  lxd.activate: oneshot, enabled, inactive
  lxd.daemon:   simple, enabled, inactive
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: today at 13:32 UTC
channels:
  latest/stable:    4.0.0       2020-04-17 (14709) 62MB -
  latest/candidate: 4.0.0       2020-04-21 (14759) 62MB -
  latest/beta:      ↑
  latest/edge:      git-b65cc2d 2020-04-21 (14768) 62MB -
  4.0/stable:       4.0.0       2020-04-17 (14709) 62MB -
  4.0/candidate:    4.0.0       2020-04-21 (14759) 62MB -
  4.0/beta:         ↑
  4.0/edge:         ↑
...
installed:          4.0.0                  (14503) 70MB -

# snap changes
ID   Status  Spawn               Ready               Summary
102  Done    today at 07:24 UTC  today at 07:24 UTC  Running service command
103  Error   today at 07:26 UTC  today at 07:26 UTC  Refresh snap "lxd"
104  Error   today at 07:28 UTC  today at 07:29 UTC  Refresh snap "lxd"
105  Error   today at 07:38 UTC  today at 07:38 UTC  Refresh "lxd" snap from "4.0/stable" channel
106  Done    today at 07:59 UTC  today at 07:59 UTC  Running service command
107  Done    today at 08:06 UTC  today at 08:06 UTC  Running service command
108  Done    today at 08:06 UTC  today at 08:06 UTC  Running service command
109  Error   today at 08:28 UTC  today at 08:28 UTC  Refresh "lxd" snap from "4.0/stable" channel
110  Error   today at 08:41 UTC  today at 08:42 UTC  Refresh "lxd" snap from "4.0/stable" channel
111  Done    today at 08:51 UTC  today at 08:51 UTC  Running service command
112  Error   today at 09:18 UTC  today at 09:19 UTC  Refresh "lxd" snap from "4.0/stable" channel
113  Error   today at 09:47 UTC  today at 09:47 UTC  Refresh "lxd" snap from "4.0/stable" channel
114  Error   today at 09:52 UTC  today at 09:52 UTC  Refresh "lxd" snap from "4.0/stable" channel
115  Error   today at 10:00 UTC  today at 10:00 UTC  Auto-refresh snap "lxd"
116  Error   today at 10:28 UTC  today at 10:28 UTC  Refresh "lxd" snap from "4.0/stable" channel
117  Error   today at 11:21 UTC  today at 11:21 UTC  Refresh "lxd" snap from "4.0/stable" channel
118  Error   today at 11:26 UTC  today at 11:27 UTC  Refresh "lxd" snap
119  Error   today at 12:21 UTC  today at 12:21 UTC  Refresh "lxd" snap
120  Error   today at 12:42 UTC  today at 12:42 UTC  Refresh "lxd" snap
121  Error   today at 12:43 UTC  today at 12:43 UTC  Refresh "lxd" snap from "4.0/stable" channel
122  Error   today at 13:00 UTC  today at 13:01 UTC  Refresh "lxd" snap from "4.0/stable" channel
123  Error   today at 13:01 UTC  today at 13:01 UTC  Refresh "lxd" snap from "4.0/stable" channel
124  Error   today at 13:18 UTC  today at 13:19 UTC  Refresh "lxd" snap from "4.0/stable" channel
125  Error   today at 13:20 UTC  today at 13:20 UTC  Refresh "lxd" snap
126  Error   today at 13:22 UTC  today at 13:22 UTC  Refresh snap "lxd"
127  Error   today at 13:27 UTC  today at 13:27 UTC  Install "lxd" snap from file "lxd_14709.snap"
128  Error   today at 13:31 UTC  today at 13:32 UTC  Install "lxd" snap from file "./lxd_14709.snap"
# snap change 126
Status  Spawn               Ready               Summary
Done    today at 13:22 UTC  today at 13:22 UTC  Ensure prerequisites for "lxd" are available
Undone  today at 13:22 UTC  today at 13:22 UTC  Download snap "lxd" (14709) from channel "latest/stable"
Done    today at 13:22 UTC  today at 13:22 UTC  Fetch and check assertions for snap "lxd" (14709)
Undone  today at 13:22 UTC  today at 13:22 UTC  Mount snap "lxd" (14709)
Undone  today at 13:22 UTC  today at 13:22 UTC  Run pre-refresh hook of "lxd" snap if present
Undone  today at 13:22 UTC  today at 13:22 UTC  Stop snap "lxd" services
Undone  today at 13:22 UTC  today at 13:22 UTC  Remove aliases for snap "lxd"
Undone  today at 13:22 UTC  today at 13:22 UTC  Make current revision for snap "lxd" unavailable
Undone  today at 13:22 UTC  today at 13:22 UTC  Copy snap "lxd" data
Undone  today at 13:22 UTC  today at 13:22 UTC  Setup snap "lxd" (14709) security profiles
Undone  today at 13:22 UTC  today at 13:22 UTC  Make snap "lxd" (14709) available to the system
Undone  today at 13:22 UTC  today at 13:22 UTC  Automatically connect eligible plugs and slots of snap "lxd"
Undone  today at 13:22 UTC  today at 13:22 UTC  Set automatic aliases for snap "lxd"
Undone  today at 13:22 UTC  today at 13:22 UTC  Setup snap "lxd" aliases
Undone  today at 13:22 UTC  today at 13:22 UTC  Run post-refresh hook of "lxd" snap if present
Error   today at 13:22 UTC  today at 13:22 UTC  Start snap "lxd" (14709) services
Hold    today at 13:22 UTC  today at 13:22 UTC  Remove data for snap "lxd" (14353)
Hold    today at 13:22 UTC  today at 13:22 UTC  Remove snap "lxd" (14353) from the system
Hold    today at 13:22 UTC  today at 13:22 UTC  Clean up "lxd" (14709) install
Hold    today at 13:22 UTC  today at 13:22 UTC  Run configure hook of "lxd" snap if present
Hold    today at 13:22 UTC  today at 13:22 UTC  Run health check of "lxd" snap
Done    today at 13:22 UTC  today at 13:22 UTC  Consider re-refresh of "lxd"

......................................................................
Start snap "lxd" (14709) services

2020-04-21T13:22:25Z ERROR [start snap.lxd.activate.service] failed with exit status 1: Job for snap.lxd.activate.service failed because the control process exited with error code.
See "systemctl status snap.lxd.activate.service" and "journalctl -xe" for details.

# snap change 128
Status  Spawn               Ready               Summary
Done    today at 13:31 UTC  today at 13:32 UTC  Ensure prerequisites for "lxd" are available
Undone  today at 13:31 UTC  today at 13:32 UTC  Prepare snap "/var/lib/snapd/snaps/.local-install-341550792" (14709)
Undone  today at 13:31 UTC  today at 13:32 UTC  Mount snap "lxd" (14709)
Undone  today at 13:31 UTC  today at 13:32 UTC  Run pre-refresh hook of "lxd" snap if present
Undone  today at 13:31 UTC  today at 13:32 UTC  Stop snap "lxd" services
Undone  today at 13:31 UTC  today at 13:32 UTC  Remove aliases for snap "lxd"
Undone  today at 13:31 UTC  today at 13:32 UTC  Make current revision for snap "lxd" unavailable
Undone  today at 13:31 UTC  today at 13:32 UTC  Copy snap "lxd" data
Undone  today at 13:31 UTC  today at 13:32 UTC  Setup snap "lxd" (14709) security profiles
Undone  today at 13:31 UTC  today at 13:32 UTC  Make snap "lxd" (14709) available to the system
Undone  today at 13:31 UTC  today at 13:32 UTC  Automatically connect eligible plugs and slots of snap "lxd"
Undone  today at 13:31 UTC  today at 13:32 UTC  Set automatic aliases for snap "lxd"
Undone  today at 13:31 UTC  today at 13:32 UTC  Setup snap "lxd" aliases
Undone  today at 13:31 UTC  today at 13:32 UTC  Run post-refresh hook of "lxd" snap if present
Error   today at 13:31 UTC  today at 13:32 UTC  Start snap "lxd" (14709) services
Hold    today at 13:31 UTC  today at 13:32 UTC  Remove data for snap "lxd" (14353)
Hold    today at 13:31 UTC  today at 13:32 UTC  Remove snap "lxd" (14353) from the system
Hold    today at 13:31 UTC  today at 13:32 UTC  Clean up "lxd" (14709) install
Hold    today at 13:31 UTC  today at 13:32 UTC  Run configure hook of "lxd" snap if present
Hold    today at 13:31 UTC  today at 13:32 UTC  Run health check of "lxd" snap

......................................................................
Start snap "lxd" (14709) services

2020-04-21T13:32:01Z ERROR [start snap.lxd.activate.service] failed with exit status 1: Job for snap.lxd.activate.service failed because the control process exited with error code.
See "systemctl status snap.lxd.activate.service" and "journalctl -xe" for details.

Just to have the complete picture, can you also paste the journalctl logs that match the start and end timestamps of snap change 128 (or snap change 126, should be the same). The timestamp of the journalctl log that you pasted earlier seem the ones from snap change 123 (i.e. around 13:01 UTC).

Apr 21 13:32:14 cloud10 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Apr 21 13:32:14 cloud10 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 21 13:32:14 cloud10 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 21 13:32:14 cloud10 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 6.
Apr 21 13:32:14 cloud10 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 21 13:32:14 cloud10 systemd[1]: Started Service for snap application lxd.daemon.
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: => Preparing the system (14503)
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Loading snap configuration
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Setting up mntns symlink (mnt:[4026533096])
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Setting up kmod wrapper
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Preparing /boot
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Preparing a clean copy of /run
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Preparing a clean copy of /etc
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Setting up ceph configuration
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Setting up LVM configuration
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Rotating logs
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Setting up ZFS (0.7)
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Escaping the systemd cgroups
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ====> Detected cgroup V1
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Escaping the systemd process resource limits
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Disabling shiftfs on this kernel (auto)
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: => Re-using existing LXCFS
Apr 21 13:32:14 cloud10 lxd.daemon[4113]: ==> Cleaning up existing LXCFS namespace
Apr 21 13:32:15 cloud10 lxd.daemon[4113]: => Starting LXD
Apr 21 13:32:15 cloud10 lxd.daemon[4113]: t=2020-04-21T13:32:15+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swa
Apr 21 13:32:16 cloud10 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Apr 21 13:32:16 cloud10 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 21 13:32:16 cloud10 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 21 13:32:16 cloud10 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 7.

And what about /var/snap/lxd/common/lxd/logs/lxd.log in the same time window?

@stgraber I’m a bit confused that during the install of 14709, snapd outputs:

Preparing the system (14503)

is that normal? The only explanation is that the version of the currently installed snap is printed (14503), but that seems odd.

t=2020-04-21T13:32:15+0000 lvl=info msg="LXD 4.0.0 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-04-21T13:32:15+0000 lvl=info msg="Kernel uid/gid map:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - u 0 0 4294967295"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - g 0 0 4294967295"
t=2020-04-21T13:32:15+0000 lvl=info msg="Configured LXD uid/gid map:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-04-21T13:32:15+0000 lvl=info msg="Kernel features:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - uevent injection: no"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - seccomp listener: no"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - cgroup layout: hybrid"
t=2020-04-21T13:32:15+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - shiftfs support: disabled"
t=2020-04-21T13:32:15+0000 lvl=info msg="Initializing local database"
t=2020-04-21T13:32:15+0000 lvl=info msg="Starting /dev/lxd handler:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-04-21T13:32:15+0000 lvl=info msg="REST API daemon:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-04-21T13:32:15+0000 lvl=info msg=" - binding TCP socket" socket=172.16.0.10:8443
t=2020-04-21T13:32:15+0000 lvl=info msg="Initializing global database"
t=2020-04-21T13:32:15+0000 lvl=warn msg="Dqlite: server unavailable err=failed to establish network connection: Head \"https://172.16.11.1:8443/internal/database\": dial tcp 172.16.11.1:8443: connect: connection refused address=172.16.11.1:8443 attempt=0"
t=2020-04-21T13:32:15+0000 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade"
t=2020-04-21T13:32:15+0000 lvl=info msg="Starting shutdown sequence"
t=2020-04-21T13:32:15+0000 lvl=info msg="Stopping REST API handler:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - closing socket" socket=172.16.0.10:8443
t=2020-04-21T13:32:15+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-04-21T13:32:15+0000 lvl=info msg="Stopping /dev/lxd handler:"
t=2020-04-21T13:32:15+0000 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

I wonder if you can do something like:

  • snap stop lxd
  • snap refresh lxd
  • snap start lxd

So it stops being an idiot and reverting you to a broken revision…

@freeekanayaka snapd itself shows that it’s running the old rev I believe, so we need to have snapd stop trying to revert stuff and actually start the latest rev

@orose there’s a dirty workaround you can do though:

  • cp /snap/lxd/14709/bin/lxd /var/snap/lxd/common/lxd.debug

This will have LXD start the 14709 version no matter what snapd thinks.
Once that’s online, snap refresh better behave, at which point you can rm /var/snap/lxd/common/lxd.debug

Think I’ll have to go with dirty at the moment - need to get things up and running as soon as possible now.

The binary isn’t sticking around long enough to copy, so is there another place I can get it from?

You already downloaded the .snap file I think, right? Just unsquashfs-it to extract the whole content.

I think I’m getting somewhere now - running in debug mode to see what is what, and we’re currently doing (presumably) DB update tasks.

It’s certainly further down the process at least.

I think this is an improvement, but I now have a different problem - I think I’ve got a corrupted certificate in there somewhere now. I do know who I need to be angry at if that’s the case though. It is now starting, at least.

2020/04/21 17:25:27 http: TLS handshake error from 172.16.11.1:53914: remote error: tls: bad certificate

I don’t have lxc info to pull the certificate from any node at present, so is there any way of installing it from another node?

Dirty solution seems to have triggered enough of a DB update to get things up and running, so thanks everyone.