LXD 3.14 on snap fails

gpatel-fr · June 19, 2019, 9:11pm

oups, that’s
sudo systemctl status snap.lxd.daemon

yajrendrag · June 19, 2019, 9:21pm

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2019-06-19 15:19:35 MDT; 1min 4s ago
Process: 31311 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 31311 (code=exited, status=1/FAILURE)
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 5.
Jun 19 15:19:35 lab0 systemd[1]: Stopped Service for snap application lxd.daemon.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
Jun 19 15:19:35 lab0 systemd[1]: Failed to start Service for snap application lxd.daemon.

gpatel-fr · June 19, 2019, 10:04pm

Yes that’s wrong, now what’s in this file:

/var/snap/lxd/common/lxd/logs/lxd.log

yajrendrag · June 19, 2019, 10:17pm

t=2019-06-19T16:15:13-0600 lvl=info msg=“LXD 3.14 is starting in normal mode” path=/var/snap/lxd/common/lxd

t=2019-06-19T16:15:13-0600 lvl=info msg=“Kernel uid/gid map:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - u 0 0 4294967295"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - g 0 0 4294967295"

t=2019-06-19T16:15:13-0600 lvl=info msg=“Configured LXD uid/gid map:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - u 0 1000000 1000000000"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - g 0 1000000 1000000000"

t=2019-06-19T16:15:13-0600 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Kernel features:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - netnsid-based network retrieval: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - uevent injection: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - seccomp listener: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - unprivileged file capabilities: yes"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - shiftfs support: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing local database”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Starting /dev/lxd handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

t=2019-06-19T16:15:13-0600 lvl=info msg=“REST API daemon:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding TCP socket" socket=[::]:8440

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing global database”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing storage pools”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Applying patch: storage_api_rename_container_snapshots_dir_again”

t=2019-06-19T16:15:13-0600 lvl=eror msg=“Failed to mount DIR storage pool “/var/lib/snapd/hostfs/media/lxd-pool2” onto “/var/snap/lxd/common/lxd/storage-pools/lxd-pool2”: no such file or directory”

t=2019-06-19T16:15:13-0600 lvl=eror msg=“Failed to start the daemon: no such file or directory”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Starting shutdown sequence”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Stopping REST API handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=[::]:8440

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket

t=2019-06-19T16:15:13-0600 lvl=info msg=“Stopping /dev/lxd handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

t=2019-06-19T16:15:13-0600 lvl=info msg=“Closing the database”

gpatel-fr · June 19, 2019, 10:47pm

it seems that the problem is here. It’s disturbing to see the ‘patch’ that is supposed I think to fix the btrfs problem just before the dir storage mount problem. There was also a change in the dir storage in 3.14 to allow quotas. It’s unclear what could cause what. Do you have the same kind of message about the dir storage in the previous days (before the btrfs patch) - see previous lxd.log files.

stgraber · June 19, 2019, 11:32pm

This error suggests that one of those two paths is missing:

/media/lxd-pool2
/var/snap/lxd/common/lxd/storage-pools/lxd-pool2

My guess would usually be on the former as that’s not under LXD’s control but the latter would explain that too.

stgraber · June 19, 2019, 11:36pm

LXD 3.14 has a patch which must run on initial startup and which corrects some issues with some containers snapshots on all storage pools. As a result of that, all storage pools get brought up so they can be inspected.

In your case, it appears that one of your storage pools isn’t currently functional (cannot be mounted due to a missing path) which causes the update to fail and LXD to refuse to start.

Resolving the source of the problem (likely /media/lxd-pool2 being missing for some reason) will cause LXD’s patch to apply and LXD to come back online.

If that’s a pool which shouldn’t exist, you can workaround the current issue by just creating the missing directory and once LXD is online, delete the storage pool through LXD, at which point you can safely get rid of the directory.

We had a few similar reports of issues with this update where people suddenly noticed they had created a whole bunch of storage pools backed by directories in /tmp, as those obviously don’t survive a reboot and the user forgot to delete those pools from LXD, things broke when LXD now had to get them all mounted to inspect their content.
The same trick of creating the needed directories and starting LXD again resolved that at which point the pools were deleted by the user from LXD to avoid further issues down the line.

yajrendrag · June 20, 2019, 2:20am

ok, that was the problem. the 2nd of those directories exists, but /media/lxd-pool2 does not exist - i created it once upon a time as a plain directory pool to do some testing, but then never deleted it from the lxd config.

after creating /media/lxd-pool2 the snap refresh to 3.14 took just fine and all containers continue to run and lxc command work as expected.

i wasn’t getting that error in 3.13 in the lxd.log file - i’m assuming thats because it doesn’t have the patch you describe to bring up all storage pools for inspection…

thanks.

SoulSeekkor · June 20, 2019, 12:41pm

First I apologize for the format of these logs, but it appears that my issue (along with another person who just had their’s crash and the revert doesn’t work as it reverts to 3.14 for some reason) is getting this:

t=2019-06-20T07:33:11-0500 lvl=info msg="Applying patch: storage_api_rename_container_snapshots_dir_again"
t=2019-06-20T07:33:12-0500 lvl=eror msg="Failed to start the daemon: rename /var/snap/lxd/common/lxd/storage-pools/default/snapshots/ss-bit-dev/docker /v$
t=2019-06-20T07:33:12-0500 lvl=info msg="Starting shutdown sequence"

Zerophen · June 20, 2019, 1:18pm

I am getting a similar error to the above:

t=2019-06-20T12:34:15+0000 lvl=info msg=“Applying patch: storage_api_rename_container_snapshots_dir_again”
t=2019-06-20T12:34:16+0000 lvl=eror msg=“Failed to start the daemon: open /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots: is a directory”

stgraber · June 21, 2019, 12:01am

That’s a different and somewhat confusing one. Are you also on btrfs?

stgraber · June 21, 2019, 12:03am

Any chance you can look directly at /var/snap/lxd/common/lxd/logs/lxd.log? it may have the full version of that error as you seem to be missing the target path and the actual error there which would be good to get.

stgraber · June 21, 2019, 12:09am

Note that I did post a temporary workaround here:

This will get you past the update, though you may need to do the same trick for the recently introduced storage_api_rename_container_snapshots_dir_again_again one too.

However, this isn’t a solution, your filesystem will still be wrong and you won’t be able to interact with any snapshot that wasn’t properly migrated to the new directory, so while this will unblock you and get you back online, we still need to figure out what happened to your system and find a way to move all the bits to the right spot.

Zerophen · June 21, 2019, 1:13am

I was able to get back up with the workaround for now. I created the .SQL file, ran refresh, didn’t work, so I ran a revert (to 3.14) and assume the .SQL workaround did it’s thing because right now I have functional containers. Thanks for the help so far!

SoulSeekkor · June 21, 2019, 12:58pm

Unfortunately that was the only relevant part from the log, after a refresh there’s probably only around 20 lines in that log and it’s essentially starting, fail on that line, then shutdown and nothing else but some informationals.

I remember an older version, I want to say around 3.10 that did a bunch of storage related things, and I believe around that time I noticed I had two snapshots folders like it migrated them or something, wish I had noted that somewhere now, but may not be important or relevant.

If I do the workaround just so I don’t have to worry about a refresh killing everything again will it have any negative lasting effects or will that be taken care of with a future patch? I’m okay with staying on 3.13 as long as I can and testing a new build to see if it fixes it.

stgraber · June 21, 2019, 3:07pm

A future patch could fix this, yes, but until we know exactly what’s going on, we can’t write such a patch.

The line you showed above was truncated, do you have it complete in your log file?

What you need to get to is a point where /var/snap/lxd/common/lxd/storage-pools/default/snapshots is gone and all its content is transferred into /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots.

Do you have /var/snap/lxd/common/lxd/disks/default.img?

If so, can you do:

mkdir /tmp/lxd-disk/
mount -o loop /var/snap/lxd/common/lxd/disks/default.img /tmp/lxd-disk/
ls -lh /tmp/lxd-disk/
ls -lh /tmp/lxd-disk/snapshots/*
ls -lh /tmp/lxd-disk/snapshots//
ls -lh /tmp/lxd-disk/containers-snapshots/*
ls -lh /tmp/lxd-disk/containers-snapshots//

SoulSeekkor · June 21, 2019, 4:45pm

You were correct, sorry about that, here is the full line of that log:
Failed to start the daemon: rename /var/snap/lxd/common/lxd/storage-pools/default/snapshots/ss-bit-dev/docker /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ss-bit-dev/docker: file exists

SoulSeekkor · June 25, 2019, 5:19pm

I don’t have this file.

gpatel-fr · June 25, 2019, 6:45pm

yes if you use btrfs partitions it’s logical you don’t have it, since it’s the loop mounted file used to store containers and images in the default configuration.
I think you should use something like
sudo nsenter -t $(pgrep daemon.start) -m – ls -lh /var/snap/lxd/common/lxd/storage-pools/default
and the subdirectories as posted by @stgraber
if you have other storages, do the same with these storages.

edit: I just thought that the preceding commands work only if LXD has started. Oups.

SoulSeekkor · June 26, 2019, 6:05pm

So, even more strangeness. I removed ALL snapshots from my containers, then did a snap refresh lxd and checked that log…no errors so appeared everything was good.

I rebooted, and no containers came up so I checked that log and again just informationals and no errors, everything looks good. I do an lxc list and get a permission error:
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: permission denied

I then sudo lxc list (which I shouldn’t have to do) and get a list of all my containers but they are stopped (I have autostart set on them). I proceed to try to start one and the command hangs again like before.

UPDATE: Okay as I was typing this they started coming back up finally and lxc list seems to work normally now, BUT it took an incredibly long time (normally I can reboot and be up and running in 5-10 minutes), this was like 20+ minutes…

As long as I can reboot after everything is up and quiets down and things come up okay I’m going to call this as good for me after I create a container snapshot to test.