Lxc commands fails with socket error on LXD 3.14

hatzlhoffer · June 27, 2019, 9:48am

Hi there,

yesterday i restarted my system an my containers weren’t reachable.
So i wanted to have a look at them - but the command “lxc list” fails with the message:

Error: Get http://unix.socket/1.0: EOF

The lxd deamon (installed via snap) and the unix socket seems to run (systemctl status say that they were active).

Is it possible that is a similar problem like this one?

At the moment i don’t know what i can try next - i don’t want to wreck my system because there is a productive Webserver with several sites an a database server…

I looked at several similar prolems - but they all have other Error messages in their Threads.

What can i do to find out what the problem is?

Thanks in advance and kind regards
Michael

hatzlhoffer · June 27, 2019, 11:04am

i did some further investigation:

ll /var/snap/lxd/common/lxd/storage-pools/default seems to be empty:

total 8
drwx–x--x 2 root root 4096 Sep 26 2018 ./
drwx–x--x 3 root root 4096 Sep 26 2018 …/

could this be my problem?

If i look at /var/snap/lxd/common/lxd/containers they all link to this directory:

total 24
drwx–x--x 2 root root 4096 Nov 7 2018 ./
drwxr-xr-x 16 lxd nogroup 4096 Jun 27 11:35 …/
lrwxrwxrwx 1 root root 66 Nov 7 2018 dbserver -> /var/snap/lxd/common/lxd/storage-pools/default/containers/dbserver
lrwxrwxrwx 1 root root 68 Nov 7 2018 mailserver -> /var/snap/lxd/common/lxd/storage-pools/default/containers/mailserver
lrwxrwxrwx 1 root root 67 Nov 7 2018 mailtools -> /var/snap/lxd/common/lxd/storage-pools/default/containers/mailtools
lrwxrwxrwx 1 root root 67 Nov 7 2018 webserver -> /var/snap/lxd/common/lxd/storage-pools/default/containers/webserver

ricardogsilva · June 27, 2019, 11:26am

I’m facing the same problem.

I had some problem last week and was able to bypass it by reverting to the previous lxd version (I’ using the snap version)

Now it seems I do not have this previous version, which worked OK anymore, only two 3.14 based revisions

sudo snap list --all lxd
Name  Version  Rev    Tracking  Publisher   Notes
lxd   3.14     10934  stable    canonical✓  disabled
lxd   3.14     10972  stable    canonical✓  -

my logs say this:

2019-06-27T11:09:57Z lxd.daemon[5737]: t=2019-06-27T12:09:57+0100 lvl=eror msg="Failed to mount DIR storage pool \"/var/lib/snapd/hostfs/home/lxd/storage-pools/default\" onto \"/var/snap/lxd/common/lxd/storage-pools/bigdisk\": no such file or directory"
2019-06-27T11:09:57Z lxd.daemon[5737]: t=2019-06-27T12:09:57+0100 lvl=eror msg="Failed to start the daemon: no such file or directory"
2019-06-27T11:09:57Z lxd.daemon[5737]: Error: no such file or directory
2019-06-27T11:09:58Z lxd.daemon[5737]: => LXD failed to start

This is AFTER I have already manually created the directories reported as not existing, as was suggested in

hatzlhoffer · June 27, 2019, 12:50pm

i’m totally confused…

i got the following error message:

t=2019-06-27T13:30:45+0200 lvl=info msg=“Applying patch: storage_api_rename_container_snapshots_dir_again”
t=2019-06-27T13:30:53+0200 lvl=eror msg=“Failed to start the daemon: rename /var/snap/lxd/common/lxd/storage-pools/default/snapshots/dbserver/snap0 /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/dbserver/snap0: file exists”

i did read the whole other Thread (LXD 3.14 on snap fails) …but i simply don’t know what to do now because my message is a little bit different.

What confuses me absolutely is that /var/snap/lxd/common/lxd/storage-pools/default is completely empty. But in the above error line there is the message: file exists…

STOP…this was in lxd.log.1

In lxd.log there now is no error message - but lxc list gives me the same error:
Error: Get http://unix.socket/1.0: EOF

stgraber · June 27, 2019, 2:18pm

@hatzlhoffer try:

sudo mount /var/snap/lxd/common/lxd/disks/default.img /mnt
sudo ls -lh /mnt/snapshots
sudo ls -lh /mnt/snapshots/*
sudo ls -lh /mnt/containers-snapshots
sudo ls -lh /mnt/containers-snapshots/*

The error suggests you have a container which exists in both, making the migration from one to the other impossible. Once we’ve confirmed that’s the case, check if either the source or target snapshot is empty and blow away whichever is empty, then start LXD again.

hatzlhoffer · June 27, 2019, 2:39pm

Hi Stephane,

all of these directories are not empty.

But if i make
sudo ls -lh /mnt/containers-snapshots/*

i get:
ls: cannot access '/mnt/containers-snapshots/*': No such file or directory

but the directory is there an not empty.
If i change to root with
sudo su

and make a
ls -lh /mnt/containers-snapshots/*

then i see a lot of snapshots from my several containers.

the directory containers-snapshots has no read-access for group and others:

root@indianer2:/mnt# ll
total 20
drwxr-xr-x  1 root root  102 Nov  8  2018 ./
drwxr-xr-x 23 root root 4096 Jun 20 06:31 ../
drwxr-xr-x  1 root root   72 Dez 30 00:17 containers/
drwx--x--x  1 root root   72 Nov  8  2018 containers-snapshots/
drwxr-xr-x  1 root root    0 Sep 26  2018 custom/
drwxr-xr-x  1 root root    0 Okt  7  2018 images/
drwxr-xr-x  1 root root   72 Nov  6  2018 snapshots/

could this be a problem?

Next thing i see:

in /mnt/snapshots all the snapshots have the names i gived them.
in /mnt/containers-snapshots there are only snapshots named “snap0…6”

should i paste the output of these 4 commands?
(i didnt paste them here because i thougt they were too much)

hatzlhoffer · June 27, 2019, 4:16pm

ok…i first made a copy of the default.img to have a backup if i break the img.

then i mounted it again and tried to simply delete all snapshots in both folders.
The problem is that this is mounted as a read only file system.

Can i simply mount it in rw-mode so i can delete the files or is there something i have to have a look at?

Edit: ok…i tried the whole time - but at the moment i didn’t get it mounted writable. I must be blind…

hatzlhoffer · June 28, 2019, 8:10am

i checked dmesg after mounting the file and it says:

[68987.422847] BTRFS info (device loop0): disk space caching is enabled
[68987.422851] BTRFS info (device loop0): has skinny extents

before 2 days we had a power loss - and i think this could have caused a damage of the btrfs.
this could be the problem that the migration could not be started i think…

stgraber · June 28, 2019, 3:14pm

Ok, can you list the content of both containers-snapshots and snapshots as well as the content of all of those directories?

We need to figure out what containers exist in both containers-snapshots and snapshots and then track down what snapshots appear to exist on both.

I’d then expect one of the two sides to be empty, we get rid of that one and things should unblock.

stgraber · June 28, 2019, 3:16pm

read-only is perfectly normal, those btrfs subvolumes are marked read-only, so you can’t delete/move them the normal way. LXD has logic that handles that, it just can’t do anything if it finds a snapshot existing in both old and new path, that needs manual resolving first.

hatzlhoffer · June 29, 2019, 3:54pm

so - this is the output of (mount point is /mnt/def_ori):

ls -lh /mnt/def_ori/snapshots

total 0
drwx--x--x 1 root root  62 Nov  6  2018 dbserver
drwx--x--x 1 root root 206 Nov  6  2018 mailserver
drwx--x--x 1 root root  16 Nov  6  2018 mailtools
drwx--x--x 1 root root 132 Nov  6  2018 webserver

ls -lh /mnt/def_ori/snapshots/*

/mnt/def_ori/snapshots/dbserver:
total 0
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 20181106
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 final_eingerichtet
drwxr-x--x+ 1 100000 100000 78 Sep 27  2018 snap0

/mnt/def_ori/snapshots/mailserver:
total 0
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 20181106
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 final_eingerichtet
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 inkl_apache2
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 mit_postfixadmin
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 vor_inst_apache
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 vor_inst_mailutils
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 vor_postfixadmin

/mnt/def_ori/snapshots/mailtools:
total 0
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 20181106

/mnt/def_ori/snapshots/webserver:
total 0
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 20181106
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 final_eingerichtet
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap0
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap1
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap2
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 vor_change_cert
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 vor_update

ls -lh /mnt/def_ori/containers-snapshots/

total 0
drwx--x--x 1 root root 86 Jun 28 11:03 dbserver
drwx--x--x 1 root root 86 Mai 13 10:49 mailserver
drwx--x--x 1 root root 76 Mai 13 10:50 mailtools
drwx--x--x 1 root root 86 Mai 13 10:49 webserver

ls -lh /mnt/def_ori/containers-snapshots/*

/mnt/def_ori/containers-snapshots/dbserver:
total 0
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 20181108
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap0
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap1
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap2
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap3
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap4
drwxr-x--x+ 1 100000 100000 78 Sep 28  2018 snap5
drwx--x--x+ 1 root   root   78 Sep 28  2018 snap6

/mnt/def_ori/containers-snapshots/mailserver:
total 0
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 20181108
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap0
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap1
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap2
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap3
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap4
drwxr-x--x+ 1 100000 100000 78 Sep 26  2018 snap5
drwx--x--x+ 1 root   root   78 Sep 26  2018 snap6

/mnt/def_ori/containers-snapshots/mailtools:
total 0
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 20181108
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 snap0
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 snap1
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 snap2
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 snap3
drwxr-xr-x+ 1 100000 100000 78 Sep 28  2018 snap4
drwx--x--x+ 1 root   root   78 Sep 28  2018 snap5

/mnt/def_ori/containers-snapshots/webserver:
total 0
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 20181108
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap0
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap1
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap2
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap3
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap4
drwxr-xr-x+ 1 100000 100000 78 Sep 26  2018 snap5
drwx--x--x+ 1 root   root   78 Sep 26  2018 snap6

stgraber · June 29, 2019, 6:06pm

Ok, so unless I missed something, the duplicates are:

dbserver/snap0
webserver/snap0
webserver/snap1
webserver/snap2

We’re going to assume that the old path is the correct data there, so you’ll need to do:

btrfs subvolume delete /mnt/def_ori/containers-snapshots/dbserver/snap0
btrfs subvolume delete /mnt/def_ori/containers-snapshots/webserver/snap0
btrfs subvolume delete /mnt/def_ori/containers-snapshots/webserver/snap1
btrfs subvolume delete /mnt/def_ori/containers-snapshots/webserver/snap2

Assuming that this doesn’t complain about the read-only property and that those paths are indeed gone after running those commands as root, try starting LXD and it should be able to move the rest of the data over.

Note that you’ll want to check that I didn’t miss any of them or you’re still going to hit the same issue.

hatzlhoffer · June 30, 2019, 12:04am

thanks for your help so far.

i tried
btrfs subvolume delete /mnt/def_ori/containers-snapshots/dbserver/snap0

but then i get an error:

ERROR: not a subvolume: /mnt/def_ori/containers-snapshots/dbserver/snap0

did i miss something?
do i have to mount somethin else as subvolume?

…i tried these commands as root…

stgraber · June 30, 2019, 12:30am

Ah, then maybe the old entries are not subvolumes, try to blow them away with a good old sudo rm -Rf /mnt/def_ori/containers-snapshots/dbserver/snap0 then, that may work.

hatzlhoffer · June 30, 2019, 12:43am

ok…

this worked for 3 containers - they now work.
But the “webserver” container isnt running.

if i say “lxc start webserver” i get

lxc webserver 20190630004226.462 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc webserver 20190630004226.462 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc webserver 20190630004226.530 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc webserver 20190630004226.530 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc webserver 20190630004226.560 ERROR    dir - storage/dir.c:dir_mount:198 - No such file or directory - Failed to mount "/var/snap/lxd/common/lxd/containers/webserver/rootfs" on "/var/snap/lxd/common/lxc/"
lxc webserver 20190630004226.560 ERROR    conf - conf.c:lxc_mount_rootfs:1351 - Failed to mount rootfs "/var/snap/lxd/common/lxd/containers/webserver/rootfs" onto "/var/snap/lxd/common/lxc/" with options "(null)"
lxc webserver 20190630004226.560 ERROR    conf - conf.c:lxc_setup_rootfs_prepare_root:3498 - Failed to setup rootfs for
lxc webserver 20190630004226.560 ERROR    conf - conf.c:lxc_setup:3551 - Failed to setup rootfs
lxc webserver 20190630004226.560 ERROR    start - start.c:do_start:1282 - Failed to setup container "webserver"
lxc webserver 20190630004226.562 ERROR    sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc webserver 20190630004226.611 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:864 - Received container state "ABORTING" instead of "RUNNING"
lxc webserver 20190630004226.611 ERROR    start - start.c:__lxc_start:1975 - Failed to spawn container "webserver"
lxc webserver 20190630004226.667 WARN     conf - conf.c:lxc_map_ids:2970 - newuidmap binary is missing
lxc webserver 20190630004226.668 WARN     conf - conf.c:lxc_map_ids:2976 - newgidmap binary is missing
lxc 20190630004226.685 WARN     commands - commands.c:lxc_cmd_rsp_recv:132 - Connection reset by peer - Failed to receive response for command "get_state"

stgraber · June 30, 2019, 12:45am

Can you post stat /var/snap/lxd/common/lxd/containers/webserver/rootfs

And ls -lh /mnt/def_ori/containers/?

hatzlhoffer · June 30, 2019, 12:47am

stat /var/snap/lxd/common/lxd/containers/webserver/rootfs

stat: cannot stat '/var/snap/lxd/common/lxd/containers/webserver/rootfs': No such file or directory

ls -lh /mnt/def_ori/containers/

ls: cannot access '/mnt/def_ori/containers/': No such file or directory

hatzlhoffer · June 30, 2019, 12:48am

root@indianer2:/var/snap/lxd/common/lxd/containers# ll
total 24
drwx–x--x 3 root root 4096 Jun 28 16:05 ./
drwxr-xr-x 16 lxd nogroup 4096 Jun 30 02:39 …/
lrwxrwxrwx 1 root root 66 Nov 7 2018 dbserver -> /var/snap/lxd/common/lxd/storage-pools/default/containers/dbserver
lrwxrwxrwx 1 root root 68 Nov 7 2018 mailserver -> /var/snap/lxd/common/lxd/storage-pools/default/containers/mailserver
lrwxrwxrwx 1 root root 67 Nov 7 2018 mailtools -> /var/snap/lxd/common/lxd/storage-pools/default/containers/mailtools
drwx–x--x 2 root root 4096 Jun 28 16:05 webserver/

hatzlhoffer · June 30, 2019, 12:50am

i think its a problem that “webserver” is not mounted correct…is this possible?

stgraber · June 30, 2019, 12:51am

Right, /var/snap/lxd/common/lxd/containers/webserver should be a symlink like the others.

Assuming that what you have right now is an empty directory:

sudo rmdir /var/snap/lxd/common/lxd/containers/webserver
sudo ln -s /var/snap/lxd/common/lxd/storage-pools/default/containers/webserver /var/snap/lxd/common/lxd/containers/webserver
lxc start webserver

That’s assuming that it does exist on the btrfs pool at least.