Daemon failed to start

tompos · February 9, 2021, 10:59pm

hi,

Recently I faced with a dead daemon on one of my servers, snap start lxd failed to start it.
In the logs I found complaining messages about dqlite.
In forum I found the idea to remove the last segment of dqlite files and it helped:

root@tummy:/var/snap/lxd/common/lxd/database/global# mv 0000000000085677-0000000000085691 0000000000085677-0000000000085691~

Awesome.
But what did happen? Why did this happen?

10x
tamas

stgraber · February 9, 2021, 11:31pm

Something bad happened with the database, it could have been a crash during shutdown, your system crashing, ran out of disk space, … anyway. The last DB transaction segment somehow got corrupted and LXD couldn’t read it back during startup.

We now have @mbordere on our team who’s going through quite a bit of backlog of issues with dqlite and doing a bunch of stress testing to track down issues like this.

tompos · February 11, 2021, 9:50pm

Removing the last segment did not fix everything entirely.

lxc start efop

Error: Failed preparing container for start: Failed to run: zfs mount tank/lxd/containers/efop: cannot mount ‘tank/lxd/containers/efop’: filesystem already mounted
Try lxc info --show-log efop for more info

Any advice on this?

stgraber · February 11, 2021, 9:52pm

If it’s an option, a reboot of your system will take care of that cleanly.

If not, can you show me grep containers/efop /proc/*/mountinfo?

tompos · February 11, 2021, 10:24pm

I can reboot it, but I would rather find a more permanent or engineering solution:)

I couple of months ago I had a very similar issue. I don’t find the solution from that, but your instructions helped definitely (it was something like clean up namespaces…).

root@tummy:~# grep containers/efop /proc/*/mountinfo
/proc/2096/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/2318/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/2417/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/2709135/mountinfo:6039 6216 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
grep: /proc/2854549/mountinfo: No such file or directory
grep: /proc/2854553/mountinfo: No such file or directory
grep: /proc/2854809/mountinfo: No such file or directory
grep: /proc/2854810/mountinfo: No such file or directory
grep: /proc/2854811/mountinfo: No such file or directory
grep: /proc/2854812/mountinfo: No such file or directory
grep: /proc/2854813/mountinfo: No such file or directory
grep: /proc/2854814/mountinfo: No such file or directory
/proc/2895/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/3274/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/3515/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/3793/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4109630/mountinfo:6039 6216 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4109760/mountinfo:6039 6216 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4109863/mountinfo:6039 6216 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4405/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4595/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/4861/mountinfo:6037 762 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl
/proc/532341/mountinfo:6039 6216 0:89 / /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop rw,noatime shared:197 - zfs tank/lxd/containers/efop rw,xattr,posixacl

Thanks,

tamas

stgraber · February 11, 2021, 10:29pm

A reboot would get you a clean mount table which should avoid further issues, what I can give you to avoid a reboot are just workarounds which may break for the next container.

In this case, you can do:
nsenter -t 532341 umount -l /var/snap/lxd/common/shmounts/
nsenter -t 532341 umount /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop

tompos · February 11, 2021, 10:50pm

root@tummy:~# nsenter -t 532341 umount -l /var/snap/lxd/common/shmounts/
umount: /var/snap/lxd/common/shmounts/: not mounted.
root@tummy:~# nsenter -t 532341 umount /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop
umount: /var/snap/lxd/common/shmounts/storage-pools/default/containers/efop: no mount point specified.

I can reboot the machine. I’m just worried about this issue, that it will come back in the future again.

What could we do to avoid it?

Or do you also expect to solve these kind of issues from dqlite fixes you mentioned in your first message?

10x

t

stgraber · February 11, 2021, 11:36pm

This has nothing to do with dqlite, it’s a mount namespace bug that’s been affecting LXD for a few years, we’re still trying to sort out a solid reproducer so we can fix the remaining edge cases.

Sadly once the bug hits, there isn’t a lot we can do, we can clear some mount entries to make LXD happy again, but that in turn will affect other containers.

That’s why I usually lead with just rebooting the machine as that guarantees you’re in a sane state again. It’s only if that’s not an option that we really start offering the workarounds

stgraber · February 11, 2021, 11:36pm