Problem with starting containers on Ubuntu 20.04

Hi,

I have a problems with starting of LXD containers on ubuntu 20.04, on fresh installations as well as on the upgradees from 18.04, problem exhibits on LXD versions:
lxd 4.0.1 14804 latest/stable/… canonical✓ -
and
lxd 4.3 15913 latest/stable/… canonical✓ -

I’ve stripped it down to a minimal case where the problem reproduces - now I have 20 small containers (debian stretch and buster, but i think it does not matter), all of them are in a dedicated LVs formated to ext4:

root@u2004-1:~# mount|grep lxd/common
/dev/mapper/vg0-lxd_common on /var/snap/lxd/common type ext4 (rw,relatime)
/dev/mapper/vg0-c10 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c10 type ext4 (rw,relatime)
/dev/mapper/vg0-c9 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c9 type ext4 (rw,relatime)
/dev/mapper/vg0-c3 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c3 type ext4 (rw,relatime)
/dev/mapper/vg0-c1 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c1 type ext4 (rw,relatime)
/dev/mapper/vg0-c2 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c2 type ext4 (rw,relatime)
/dev/mapper/vg0-c8 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c8 type ext4 (rw,relatime)
/dev/mapper/vg0-c11 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c11 type ext4 (rw,relatime)
/dev/mapper/vg0-c7 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c7 type ext4 (rw,relatime)
/dev/mapper/vg0-c14 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c14 type ext4 (rw,relatime)
/dev/mapper/vg0-c13 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c13 type ext4 (rw,relatime)
/dev/mapper/vg0-c12 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c12 type ext4 (rw,relatime)
/dev/mapper/vg0-c4 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c4 type ext4 (rw,relatime)
/dev/mapper/vg0-c5 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c5 type ext4 (rw,relatime)
/dev/mapper/vg0-c6 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c6 type ext4 (rw,relatime)
/dev/mapper/vg0-c16 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c16 type ext4 (rw,relatime)
/dev/mapper/vg0-c18 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c18 type ext4 (rw,relatime)
/dev/mapper/vg0-c17 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c17 type ext4 (rw,relatime)
/dev/mapper/vg0-c15 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c15 type ext4 (rw,relatime)
/dev/mapper/vg0-c20 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c20 type ext4 (rw,relatime)
/dev/mapper/vg0-c19 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c19 type ext4 (rw,relatime)
tmpfs on /var/snap/lxd/common/ns type tmpfs (rw,relatime,size=1024k,mode=700)
nsfs on /var/snap/lxd/common/ns/shmounts type nsfs (rw)
nsfs on /var/snap/lxd/common/ns/mntns type nsfs (rw)

after launching them everything is fine, but after reboot some of them, always random set, do not start:

root@u2004-1:~# lxc start --all
c2: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c2 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c2/lxc.conf: 
c2: Try `lxc info --show-log c2` for more info
c13: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c13 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c13/lxc.conf: 
c13: Try `lxc info --show-log c13` for more info
c14: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c14 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c14/lxc.conf: 
c14: Try `lxc info --show-log c14` for more info
c18: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c18 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c18/lxc.conf: 
c18: Try `lxc info --show-log c18` for more info
c20: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c20 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c20/lxc.conf: 
c20: Try `lxc info --show-log c20` for more info
c16: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c16 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c16/lxc.conf: 
c16: Try `lxc info --show-log c16` for more info
c19: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c19 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c19/lxc.conf: 
c19: Try `lxc info --show-log c19` for more info
c15: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c15 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c15/lxc.conf: 
c15: Try `lxc info --show-log c15` for more info
c17: error: Failed to run: /snap/lxd/current/bin/lxd forkstart c17 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/c17/lxc.conf: 
c17: Try `lxc info --show-log c17` for more info

Error: Some instances failed to start
root@u2004-1:~# lxc ls --format csv|grep -i running|wc -l
11

root@u2004-1:~# lxc info --show-log c17
Name: c17
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/07/04 18:18 UTC
Status: Stopped
Type: container
Profiles: default

Log:

lxc c17 20200704221929.441 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1143 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.c17"
lxc c17 20200704221929.812 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1143 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.c17"
lxc c17 20200704221929.829 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1455 - No such file or directory - Failed to fchownat(17, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc c17 20200704221930.585 ERROR    dir - storage/dir.c:dir_mount:152 - No such file or directory - Failed to mount "/var/snap/lxd/common/lxd/containers/c17/rootfs" on "/var/snap/lxd/common/lxc/"
lxc c17 20200704221930.585 ERROR    conf - conf.c:lxc_mount_rootfs:1256 - Failed to mount rootfs "/var/snap/lxd/common/lxd/containers/c17/rootfs" onto "/var/snap/lxd/common/lxc/" with options "(null)"
lxc c17 20200704221930.585 ERROR    conf - conf.c:lxc_setup_rootfs_prepare_root:3178 - Failed to setup rootfs for
lxc c17 20200704221930.585 ERROR    conf - conf.c:lxc_setup:3277 - Failed to setup rootfs
lxc c17 20200704221930.586 ERROR    start - start.c:do_start:1231 - Failed to setup container "c17"
lxc c17 20200704221930.605 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc c17 20200704221930.251 WARN     network - network.c:lxc_delete_network_priv:3213 - Failed to rename interface with index 0 from "eth0" to its initial name "vethbc59a0f9"
lxc c17 20200704221930.251 ERROR    start - start.c:__lxc_start:1952 - Failed to spawn container "c17"
lxc c17 20200704221930.251 WARN     start - start.c:lxc_abort:1025 - No such process - Failed to send SIGKILL via pidfd 30 for process 3522
lxc c17 20200704221930.252 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:852 - Received container state "ABORTING" instead of "RUNNING"
lxc 20200704221931.283 WARN     commands - commands.c:lxc_cmd_rsp_recv:122 - Connection reset by peer - Failed to receive response for command "get_state"

it does not matter if I run lxc start --all or start one by one with some delay between starts, once this happens the failed container cannot be started again.

I’ve discovered there is a file backup.yaml created under the mount point, not on the mounted LV and that causes this problem:

root@u2004-1:~# lxc stop --all
root@u2004-1:~# cd /var/snap/lxd/common/lxd/storage-pools/default/containers/
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# umount  *
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# ls */backup.yaml -1
c13/backup.yaml
c14/backup.yaml
c15/backup.yaml
c16/backup.yaml
c17/backup.yaml
c18/backup.yaml
c19/backup.yaml
c2/backup.yaml
c20/backup.yaml

while the file is present, start of the container fails like cited above.
after their manual removal/renaming all the containers start again:

root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# for file in   */backup.yaml ; do s=$(stat -c %y $file); rename  "s/yaml/yaml-${s// /-}/" $file; done
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# for x in c*; do mount $x;done 
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# lxc start --all
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# lxc ls --format csv|grep -i running|wc -l
20

until next reboot stoping and starting works well, problem appears again after the reboot.
the same applies even if all containers are explicitely stopped prior to reboot and not autostarted on the boot.

i do not observe this problem with smaller count of containers.

it is possible to live with this workaround once one knows it, but I believe it is a bug that should be fixed…

bye,
tomas

@tomp I believe you were looking into a similar bug not too long ago.

Yes it was here After shutting down a container, backup.yaml appears in the mount directory

The OP struggled to get debug logging working as he was encountering a separate snap issue that prevented it from being activated, so we never made any progress with it.

@Tomas please can you enable the debug logging and paste the output here from starting and then stopping a container to recreate the issue, as I’d like to see what order the mount and unmounts are occurring.

Thanks

Interestingly, the previous issue was using ZFS rather than LVM, so it doesn’t appear to be storage driver specific.

I’ve enabled debug, stopped all containers, umounted LVs, renamed all backup.yaml files, rebooted and here is the log file:

https://pastebin.com/gNWXmUks

actually, driver used is ‘dir’:

root@u2004-1:~# lxc storage info default
info:
  description: ""
  driver: dir
  name: default
  space used: 921.74MB
  total space: 5.22GB

i am managing LVs outside of the LXD

So are the LVs always mounted on the host before LXD starts? Can you show the output of mount on the host when this issue occurs please?

Yes, LVs are mounted during boot from fstab in this case.

problem is reproducible even with snap disable lxd; manualy mount everything; snap enable lxd

here is full mount output, note: although there is luks used, the key for luks is taken from the file and thus lvs are mounted automatically:

root@u2004-1:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,noexec,relatime,size=973980k,nr_inodes=243495,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=203532k,mode=755)
/dev/mapper/vg0-root on / type ext4 (rw,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16646)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
/var/lib/snapd/snaps/core18_1754.snap on /snap/core18/1754 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/core18_1705.snap on /snap/core18/1705 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/snapd_7264.snap on /snap/snapd/7264 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/snapd_8140.snap on /snap/snapd/8140 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/lxd_16044.snap on /snap/lxd/16044 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/lxd_16048.snap on /snap/lxd/16048 type squashfs (ro,nodev,relatime,x-gdu.hide)
/dev/mapper/cryptvg0-lxd_common on /var/snap/lxd/common type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c2 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c2 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c11 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c11 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c4 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c4 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c7 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c7 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c5 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c5 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c3 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c3 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c15 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c15 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c17 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c17 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c16 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c16 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c13 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c13 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c14 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c14 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c10 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c10 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c18 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c18 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c1 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c1 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c20 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c20 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c19 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c19 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c9 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c9 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c6 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c6 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c12 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c12 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c8 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c8 type ext4 (rw,relatime)
tmpfs on /run/snapd/ns type tmpfs (rw,nosuid,nodev,noexec,relatime,size=203532k,mode=755)
nsfs on /run/snapd/ns/lxd.mnt type nsfs (rw)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=203528k,mode=700)
tmpfs on /var/snap/lxd/common/ns type tmpfs (rw,relatime,size=1024k,mode=700)
nsfs on /var/snap/lxd/common/ns/shmounts type nsfs (rw)
nsfs on /var/snap/lxd/common/ns/mntns type nsfs (rw)

OK thanks, will wait for your Vbox appliance.

Although if as you say, the storage pool is actually “dir” and you’re managing the LV mounts before LXD starts, then LXD won’t be unmounting anything, so its interesting that the backup.yaml somehow ends up underneath the mount.

appliance is here: http://samuell.org/tomas/u20.04-lxd.ova

right - lxd should not mount/umount/remount any LV. It appears to me that backup.yaml is written on the start, not on the stop – when i disable lxd, umount all LVS, remove all backup.yaml in the mountpoints, enable lxd back and start --all, some fail and the backup.yaml files are there

see this - i disable snapd, umount and erase yaml files, mount back and enable lxd, this attempt to start all containers, no stopping occurs. then I mount LV mounted to /var/snap/lxd/common/ second time to /mnt/2ndmont to see “under” the mountpoints c1… c20 and the backup.yaml files are already there:

root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# snap disable lxd
lxd disabled
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# umount c*
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# ls */backup.yaml
c18/backup.yaml  c19/backup.yaml  c2/backup.yaml  c20/backup.yaml  c3/backup.yaml  c4/backup.yaml  c5/backup.yaml  c6/backup.yaml
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# rm  */backup.yaml
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# for x in c*; do mount $x;done 
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# snap enable lxd
lxd enabled
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# mkdir /mnt/2ndmount
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# mount /dev/mapper/cryptvg0-lxd_common /mnt/2ndmount
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers# ls /mnt/2ndmount/lxd/storage-pools/default/containers/c*/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c18/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c19/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c2/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c20/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c3/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c4/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c5/backup.yaml
/mnt/2ndmount/lxd/storage-pools/default/containers/c6/backup.yaml
root@u2004-1:/var/snap/lxd/common/lxd/storage-pools/default/containers#

So this looks like an issue/race between the snapd lxd mount namespace and your custom mounting rules on the host.

LXD runs inside a separate mount namespace when using snap.

You can switch into the mount namespace when running LXD by finding its pid and then running:

nsenter -a -t <LXD pid>

In this case I see that:

mount | grep c20
/dev/mapper/cryptvg0-c20 on /var/lib/snapd/hostfs/var/snap/lxd/common/lxd/storage-pools/default/containers/c20 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c20 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c20 type ext4 (rw,relatime)

And the rootfs directory is missing, which explains why the container can’t start (the backup.yaml file isn’t an issue, its just its created before we try to start the container and is left over).

ls -la /var/snap/lxd/common/lxd/containers/c20/rootfs
ls: cannot access '/var/snap/lxd/common/lxd/containers/c20/rootfs': No such file or directory

@stgraber do you have any insight into any historic mount issues with the snap?

Interestingly, inside the lxd mount ns, if one runs:

mount /dev/mapper/cryptvg0-c20 /var/snap/lxd/common/lxd/storage-pools/default/containers/c20

Then the container will start OK.

And some of the containers are mounted correctly inside the lxd mount ns:

mount | grep 'on /var/snap/lxd/common/lxd/storage-pools/default/containers/'
/dev/mapper/cryptvg0-c2 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c2 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c10 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c10 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c3 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c3 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c5 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c5 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c6 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c6 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c9 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c9 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c4 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c4 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c13 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c13 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c12 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c12 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c1 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c1 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c18 on /var/snap/lxd/common/lxd/storage-pools/default/containers/c18 type ext4 (rw,relatime)

So looks like its a race condition during lxd start up, not waiting for all mounts to finish.

Interestingly, inside the mount ns, the containers are won’t start are mounted in /var/snap/lxd/common/shmounts/

root@u2004-1:/# mount | grep /var/snap/lxd/common/shmounts
/dev/mapper/cryptvg0-c10 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c10 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c11 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c11 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c6 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c6 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c2 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c2 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c15 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c15 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c16 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c16 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c17 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c17 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c19 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c19 type ext4 (rw,relatime)
/dev/mapper/cryptvg0-c18 on /var/snap/lxd/common/shmounts/storage-pools/default/containers/c18 type ext4 (rw,relatime)

in this particular case there is no added mounting logic - there are just devices/mountpoints specified in /etc/fstab

i see, but the fact is, that while there was the backup.yaml file present, the container was not able to start even after manual remount until the file has been removed. However, i cannot reproduce this behavior, so it could have been just a coincidence and i have hit the mounting race condition multiple times…

Yep indeed. I’m not so familiar with what the snap is doing regarding moving mounts into shmounts (or why).

@stgraber are you able to help with this? Thanks

That usually indicates a problem with the snap mount namespace management.
The output of journalctl -u snap.lxd.daemon shortly after the issue first appears may include useful debugging info on what failed during the mount reshuffle.

-- Reboot --
Jul 08 20:21:01 u2004-1 systemd[1]: Started Service for snap application lxd.daemon.
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: => Preparing the system (16044)
Jul 08 20:21:01 u2004-1 lxd.daemon[1443]: cmd_linux.go:160: cannot read /proc/self/exe: readlink /proc/self/exe: no such file or directory
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Loading snap configuration
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up mntns symlink (mnt:[4026532323])
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up mount propagation on /var/snap/lxd/common/lxd/devices
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up persistent shmounts path
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ====> Making LXD shmounts use the persistent path
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ====> Making LXCFS use the persistent path
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up kmod wrapper
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Preparing /boot
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Preparing a clean copy of /run
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Preparing a clean copy of /etc
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Preparing a clean copy of /usr/share/misc
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up ceph configuration
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up LVM configuration
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Rotating logs
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Setting up ZFS (0.8)
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Escaping the systemd cgroups
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ====> Detected cgroup V1
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Escaping the systemd process resource limits
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Increasing the number of inotify user instances
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: ==> Disabling shiftfs on this kernel (auto)
Jul 08 20:21:01 u2004-1 lxd.daemon[1415]: => Starting LXCFS
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: Running constructor lxcfs_init to reload liblxcfs
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: mount namespace: 4
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: hierarchies:
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   0: fd:   5:
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   1: fd:   6: name=systemd
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   2: fd:   7: pids
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   3: fd:   8: rdma
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   4: fd:   9: devices
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   5: fd:  10: hugetlb
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   6: fd:  11: cpuset
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   7: fd:  12: net_cls,net_prio
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   8: fd:  13: freezer
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:   9: fd:  14: memory
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:  10: fd:  15: cpu,cpuacct
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:  11: fd:  16: perf_event
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]:  12: fd:  17: blkio
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: Kernel supports pidfds
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: Kernel does not support swap accounting
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: api_extensions:
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - cgroups
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - sys_cpu_online
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_cpuinfo
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_diskstats
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_loadavg
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_meminfo
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_stat
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_swaps
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - proc_uptime
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - shared_pidns
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - cpuview_daemon
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - loadavg_daemon
Jul 08 20:21:01 u2004-1 lxd.daemon[1541]: - pidfds
Jul 08 20:21:02 u2004-1 lxd.daemon[1415]: => Starting LXD
Jul 08 20:21:03 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:03+0000 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight li>
Jul 08 20:21:03 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:03+0000 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swa>
Jul 08 20:21:06 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:06+0000 lvl=eror msg="Failed starting container" action=start created=2020-07>
Jul 08 20:21:06 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:06+0000 lvl=eror msg="Failed to start instance 'c11': Failed to run: /snap/lx>
Jul 08 20:21:06 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:06+0000 lvl=eror msg="Failed starting container" action=start created=2020-07>
Jul 08 20:21:06 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:06+0000 lvl=eror msg="Failed to start instance 'c12': Failed to run: /snap/lx>
Jul 08 20:21:07 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:07+0000 lvl=eror msg="Failed starting container" action=start created=2020-07>
Jul 08 20:21:07 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:07+0000 lvl=eror msg="Failed to start instance 'c19': Failed to run: /snap/lx>
Jul 08 20:21:11 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:11+0000 lvl=eror msg="Failed starting container" action=start created=2020-07>
Jul 08 20:21:11 u2004-1 lxd.daemon[1552]: t=2020-07-08T20:21:11+0000 lvl=eror msg="Failed to start instance 'c9': Failed to run: /snap/lxd>
Jul 08 20:21:11 u2004-1 lxd.daemon[1415]: => LXD is ready

hi again,

now I hit a similar, likely related problem. during this night snapd performed a refresh of lxd snap, restarted LXD and one of the mounts in one container had disappeared, although there was no restart of the container itself.

there is no mounting magic done on the server, there are 3 simple mounts in fstab:

/dev/mapper/vg0-udroot /var/snap/lxd/common/lxd/storage-pools/default/containers/ud ext4 defaults 0 2
/dev/mapper/vg0-data1 /var/snap/lxd/common/lxd/storage-pools/default/containers/ud/rootfs/data1 ext4 defaults 0 2
/dev/disk/by-path/ip-10.x.x.x:3260-iscsi-iqn.2015-01.iscsi.san:ud-data2-lun-0   /var/snap/lxd/common/lxd/storage-pools/default/containers/ud/rootfs/data2      ext4    relatime,acl,user_xattr,nofail  0       2

on the lxd start there is this line in the log:

Failed to bind-mount '/var/snap/lxd/common/shmounts/storage-pools/default/containers/ud/rootfs/data1' onto '/var/snap/lxd/common/lxd/storage-pools/default/containers/ud/rootfs/data1: No such file or directory

even though the directory is present and was mounted until the lxd restart.

here is the full output from journalctl -u snap.lxd.daemon :

-- Logs begin at Fri 2020-07-17 21:09:54 CEST, end at Tue 2020-08-04 12:11:20 CEST. --
Aug 03 23:31:03 v2 systemd[1]: Stopping Service for snap application lxd.daemon...
Aug 03 23:31:03 v2 lxd.daemon[3378896]: => Stop reason is: snap refresh
Aug 03 23:31:03 v2 lxd.daemon[3378896]: => Stopping LXD
Aug 03 23:31:04 v2 systemd[1]: snap.lxd.daemon.service: Succeeded.
Aug 03 23:31:04 v2 systemd[1]: Stopped Service for snap application lxd.daemon.
Aug 03 23:31:12 v2 systemd[1]: Started Service for snap application lxd.daemon.
Aug 03 23:31:12 v2 lxd.daemon[3379127]: => Preparing the system (16558)
Aug 03 23:31:12 v2 lxd.daemon[3379156]: cmd_linux.go:160: cannot read /proc/self/exe: readlink /proc/self/exe: no such file or directory
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Loading snap configuration
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Setting up mntns symlink (mnt:[4026532580])
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Setting up mount propagation on /var/snap/lxd/common/lxd/devices
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Setting up persistent shmounts path
Aug 03 23:31:12 v2 lxd.daemon[3379179]: Failed to bind-mount '/var/snap/lxd/common/shmounts/storage-pools/default/containers/ud/rootfs/data2' onto '/var/snap/lxd/common/lxd/storage-pools/default/containers/ud/rootfs/data2: No such file or directory
Aug 03 23:31:12 v2 lxd.daemon[3379179]: Failed to bind-mount '/var/snap/lxd/common/shmounts/storage-pools/default/containers/ud/rootfs/data1' onto '/var/snap/lxd/common/lxd/storage-pools/default/containers/ud/rootfs/data1: No such file or directory
Aug 03 23:31:12 v2 lxd.daemon[3379179]: Failed to mount new mntns: Invalid argument
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ====> Failed to setup shmounts, continuing without
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ====> Making LXD shmounts use the persistent path
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ====> Making LXCFS use the persistent path
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Setting up kmod wrapper
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Preparing /boot
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Preparing a clean copy of /run
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Preparing /run/bin
Aug 03 23:31:12 v2 lxd.daemon[3379127]: ==> Preparing a clean copy of /etc
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Preparing a clean copy of /usr/share/misc
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Setting up ceph configuration
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Setting up LVM configuration
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Rotating logs
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Setting up ZFS (0.8)
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Escaping the systemd cgroups
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ====> Detected cgroup V1
Aug 03 23:31:13 v2 lxd.daemon[3379127]: ==> Escaping the systemd process resource limits
Aug 03 23:31:13 v2 lxd.daemon[3379127]: => Starting LXCFS
Aug 03 23:31:13 v2 lxd.daemon[3379257]: Running constructor lxcfs_init to reload liblxcfs
Aug 03 23:31:13 v2 lxd.daemon[3379257]: mount namespace: 4
Aug 03 23:31:13 v2 lxd.daemon[3379257]: hierarchies:
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   0: fd:   5:
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   1: fd:   6: name=systemd
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   2: fd:   7: hugetlb
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   3: fd:   8: net_cls,net_prio
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   4: fd:   9: rdma
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   5: fd:  10: freezer
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   6: fd:  11: perf_event
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   7: fd:  12: memory
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   8: fd:  13: pids
Aug 03 23:31:13 v2 lxd.daemon[3379257]:   9: fd:  14: devices
Aug 03 23:31:13 v2 lxd.daemon[3379257]:  10: fd:  15: cpuset
Aug 03 23:31:13 v2 lxd.daemon[3379257]:  11: fd:  16: cpu,cpuacct
Aug 03 23:31:13 v2 lxd.daemon[3379257]:  12: fd:  17: blkio
Aug 03 23:31:13 v2 lxd.daemon[3379257]: Kernel supports pidfds
Aug 03 23:31:13 v2 lxd.daemon[3379257]: Kernel does not support swap accounting
Aug 03 23:31:13 v2 lxd.daemon[3379257]: api_extensions:
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - cgroups
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - sys_cpu_online
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_cpuinfo
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_diskstats
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_loadavg
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_meminfo
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_stat
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_swaps
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - proc_uptime
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - shared_pidns
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - cpuview_daemon
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - loadavg_daemon
Aug 03 23:31:13 v2 lxd.daemon[3379257]: - pidfds
Aug 03 23:31:14 v2 lxd.daemon[3379127]: => Starting LXD
Aug 03 23:31:14 v2 lxd.daemon[3379268]: t=2020-08-03T23:31:14+0200 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored"
Aug 03 23:31:14 v2 lxd.daemon[3379268]: t=2020-08-03T23:31:14+0200 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
Aug 03 23:31:17 v2 lxd.daemon[3379127]: => LXD is ready

Can you show lxc config show --expanded ud?

The error being about a sub-directory of the container is rather odd.