LXD 3.14 on snap fails

yajrendrag · June 20, 2019, 2:20am

ok, that was the problem. the 2nd of those directories exists, but /media/lxd-pool2 does not exist - i created it once upon a time as a plain directory pool to do some testing, but then never deleted it from the lxd config.

after creating /media/lxd-pool2 the snap refresh to 3.14 took just fine and all containers continue to run and lxc command work as expected.

i wasn’t getting that error in 3.13 in the lxd.log file - i’m assuming thats because it doesn’t have the patch you describe to bring up all storage pools for inspection…

thanks.

SoulSeekkor · June 20, 2019, 12:41pm

First I apologize for the format of these logs, but it appears that my issue (along with another person who just had their’s crash and the revert doesn’t work as it reverts to 3.14 for some reason) is getting this:

t=2019-06-20T07:33:11-0500 lvl=info msg="Applying patch: storage_api_rename_container_snapshots_dir_again"
t=2019-06-20T07:33:12-0500 lvl=eror msg="Failed to start the daemon: rename /var/snap/lxd/common/lxd/storage-pools/default/snapshots/ss-bit-dev/docker /v$
t=2019-06-20T07:33:12-0500 lvl=info msg="Starting shutdown sequence"

Zerophen · June 20, 2019, 1:18pm

I am getting a similar error to the above:

t=2019-06-20T12:34:15+0000 lvl=info msg=“Applying patch: storage_api_rename_container_snapshots_dir_again”
t=2019-06-20T12:34:16+0000 lvl=eror msg=“Failed to start the daemon: open /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots: is a directory”

stgraber · June 21, 2019, 12:01am

That’s a different and somewhat confusing one. Are you also on btrfs?

stgraber · June 21, 2019, 12:03am

Any chance you can look directly at /var/snap/lxd/common/lxd/logs/lxd.log? it may have the full version of that error as you seem to be missing the target path and the actual error there which would be good to get.

stgraber · June 21, 2019, 12:09am

Note that I did post a temporary workaround here:

This will get you past the update, though you may need to do the same trick for the recently introduced storage_api_rename_container_snapshots_dir_again_again one too.

However, this isn’t a solution, your filesystem will still be wrong and you won’t be able to interact with any snapshot that wasn’t properly migrated to the new directory, so while this will unblock you and get you back online, we still need to figure out what happened to your system and find a way to move all the bits to the right spot.

Zerophen · June 21, 2019, 1:13am

I was able to get back up with the workaround for now. I created the .SQL file, ran refresh, didn’t work, so I ran a revert (to 3.14) and assume the .SQL workaround did it’s thing because right now I have functional containers. Thanks for the help so far!

SoulSeekkor · June 21, 2019, 12:58pm

Unfortunately that was the only relevant part from the log, after a refresh there’s probably only around 20 lines in that log and it’s essentially starting, fail on that line, then shutdown and nothing else but some informationals.

I remember an older version, I want to say around 3.10 that did a bunch of storage related things, and I believe around that time I noticed I had two snapshots folders like it migrated them or something, wish I had noted that somewhere now, but may not be important or relevant.

If I do the workaround just so I don’t have to worry about a refresh killing everything again will it have any negative lasting effects or will that be taken care of with a future patch? I’m okay with staying on 3.13 as long as I can and testing a new build to see if it fixes it.

stgraber · June 21, 2019, 3:07pm

A future patch could fix this, yes, but until we know exactly what’s going on, we can’t write such a patch.

The line you showed above was truncated, do you have it complete in your log file?

What you need to get to is a point where /var/snap/lxd/common/lxd/storage-pools/default/snapshots is gone and all its content is transferred into /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots.

Do you have /var/snap/lxd/common/lxd/disks/default.img?

If so, can you do:

mkdir /tmp/lxd-disk/
mount -o loop /var/snap/lxd/common/lxd/disks/default.img /tmp/lxd-disk/
ls -lh /tmp/lxd-disk/
ls -lh /tmp/lxd-disk/snapshots/*
ls -lh /tmp/lxd-disk/snapshots//
ls -lh /tmp/lxd-disk/containers-snapshots/*
ls -lh /tmp/lxd-disk/containers-snapshots//

SoulSeekkor · June 21, 2019, 4:45pm

You were correct, sorry about that, here is the full line of that log:
Failed to start the daemon: rename /var/snap/lxd/common/lxd/storage-pools/default/snapshots/ss-bit-dev/docker /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ss-bit-dev/docker: file exists

SoulSeekkor · June 25, 2019, 5:19pm

I don’t have this file.

gpatel-fr · June 25, 2019, 6:45pm

yes if you use btrfs partitions it’s logical you don’t have it, since it’s the loop mounted file used to store containers and images in the default configuration.
I think you should use something like
sudo nsenter -t $(pgrep daemon.start) -m – ls -lh /var/snap/lxd/common/lxd/storage-pools/default
and the subdirectories as posted by @stgraber
if you have other storages, do the same with these storages.

edit: I just thought that the preceding commands work only if LXD has started. Oups.

SoulSeekkor · June 26, 2019, 6:05pm

So, even more strangeness. I removed ALL snapshots from my containers, then did a snap refresh lxd and checked that log…no errors so appeared everything was good.

I rebooted, and no containers came up so I checked that log and again just informationals and no errors, everything looks good. I do an lxc list and get a permission error:
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: permission denied

I then sudo lxc list (which I shouldn’t have to do) and get a list of all my containers but they are stopped (I have autostart set on them). I proceed to try to start one and the command hangs again like before.

UPDATE: Okay as I was typing this they started coming back up finally and lxc list seems to work normally now, BUT it took an incredibly long time (normally I can reboot and be up and running in 5-10 minutes), this was like 20+ minutes…

As long as I can reboot after everything is up and quiets down and things come up okay I’m going to call this as good for me after I create a container snapshot to test.