LXD 3.14 on snap fails

yajrendrag · June 18, 2019, 11:40pm

lxd 3.14 on ubuntu 18.04 - has been running fine.
updated ubuntu today and upon reboot, most lxc commands just hang - meaning the console just sits there unresponsive after issuing an lxc command, eg., “lxc list” or “lxc storage list”.

none of the containers are running - their services are inaccessible, and also cannot lxc exec container-name /bin/bash - per above, console is just unresponsive.

noticed that lxd was updated to 3.14 yesterday and so did a
snap revert lxd
which downgraded to lxd 3.13 and now lxc commands work again and container services are running and accessible.

gpatel-fr · June 19, 2019, 6:04am

I have a few snap lxd install on 18.04 and had only a temporary issue with 3.14 on one of them (nothing as annoying, just loss of network connectivity, fixed by a reboot). Maybe you could try to move again to 3.14 to see if it was a rogue condition caused by the automatic snap upgrade in conflict with apt update (I wonder if there is some sync between the two updates or if it’s just ‘shrug, bad stuff could happen’)

yajrendrag · June 19, 2019, 3:05pm

i did reboot several times after the upgrade and that didn’t work, but if you’re suggesting that now that ubuntu is upgraded to again do the snap update to 3.14, i can try that.

SoulSeekkor · June 19, 2019, 5:03pm

I too am having this problem, after the snap refresh to 3.14 everything broke and lxc commands simply hang (no containers started).

I reverted and upgraded the snap again as suggested, then rebooted to be safe…still the same problem. I reverted and immediately after reverting the snap the commands work just as yajrendrag experienced, and containers automatically start coming up (I have autostart enabled). There is definitely something seriously wrong in this latest version, I have Ubuntu 18.04 fully updated running this snap and only this snap.

stgraber · June 19, 2019, 6:58pm

Since the release of LXD 3.14 to stable users, we made two changes:

Fixed a network problem when users have manually defined routes on a LXD network
Fixed an update issue with btrfs container snapshots following a previous update failure

stable was updated as of 5 minutes ago, so you may want to give that another try, especially if you’re using btrfs.

SoulSeekkor · June 19, 2019, 8:05pm

Just did a snap refresh of lxd and after a reboot still broke, commands simply hang. I do have my containers on btrfs.

gpatel-fr · June 19, 2019, 8:12pm

are you on 10934 ? (ls /snap/lxd -lart)

yajrendrag · June 19, 2019, 8:23pm

i am on 10934… the update to stable must have forced an auto refresh to 3.14 and while my containers are still running, i get this error in response to lxc list:

Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

this happened yesterday before my ubuntu update.

let me know if there is something else you want to check before i revert again…

i am not using btrfs, but zfs. not using manually defined routes, but am using bridged networking on the host (not lxd bridge) so that i get dhcp addresses on the same network as the rest of my devices…

thanks

o0laxkilla0o · June 19, 2019, 8:34pm

gpatel-fr i know soulseeker and I had the same issue this morning. After reverting everything is going well. To be honest it was the first scary LXC/LXD issue I have ever had

SoulSeekkor · June 19, 2019, 8:39pm

Correct on the build, also using host defined bridges here, no routes, with DHCP pulled from the network.

gpatel-fr · June 19, 2019, 9:03pm

lxd daemon is probably dead (something like ps aux | grep lxd)
systemctl status snap.daemon.lxd
possibly trying to restart it with snap start lxd

I have just done an apt full-upgrade on one of my ubuntu 18.04 servers and with 10934 snap lxd works fine. Possibly of interest for people like me who install only security updates automatically is that recent ubuntu updates are quite large with notably openssl 1.1.1 (for tls 1.3) and also an ufw update that could maybe conflict with some lxd personalizations.

gpatel-fr · June 19, 2019, 9:06pm

I don’t doubt there is a real problem, the LXD maintainer just acknowledged it. Possibly the fix broke something for other people. That’s always a risk with rushed patches. However it works for me, so it’s not an universal problem. It’s interesting to compare experiences and try to find if there are common points between people seing problems.

yajrendrag · June 19, 2019, 9:08pm

this seems telling:

sudo systemctl status snap.daemon.lxd
produces:
Unit snap.daemon.lxd.service could not be found.
sudo snap start lxd completes fine, but systemctl status still returns above result…

gpatel-fr · June 19, 2019, 9:11pm

oups, that’s
sudo systemctl status snap.lxd.daemon

yajrendrag · June 19, 2019, 9:21pm

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2019-06-19 15:19:35 MDT; 1min 4s ago
Process: 31311 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 31311 (code=exited, status=1/FAILURE)
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 5.
Jun 19 15:19:35 lab0 systemd[1]: Stopped Service for snap application lxd.daemon.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
Jun 19 15:19:35 lab0 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
Jun 19 15:19:35 lab0 systemd[1]: Failed to start Service for snap application lxd.daemon.

gpatel-fr · June 19, 2019, 10:04pm

Yes that’s wrong, now what’s in this file:

/var/snap/lxd/common/lxd/logs/lxd.log

yajrendrag · June 19, 2019, 10:17pm

t=2019-06-19T16:15:13-0600 lvl=info msg=“LXD 3.14 is starting in normal mode” path=/var/snap/lxd/common/lxd

t=2019-06-19T16:15:13-0600 lvl=info msg=“Kernel uid/gid map:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - u 0 0 4294967295"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - g 0 0 4294967295"

t=2019-06-19T16:15:13-0600 lvl=info msg=“Configured LXD uid/gid map:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - u 0 1000000 1000000000"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - g 0 1000000 1000000000"

t=2019-06-19T16:15:13-0600 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Kernel features:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - netnsid-based network retrieval: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - uevent injection: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - seccomp listener: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - unprivileged file capabilities: yes"

t=2019-06-19T16:15:13-0600 lvl=info msg=" - shiftfs support: no"

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing local database”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Starting /dev/lxd handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

t=2019-06-19T16:15:13-0600 lvl=info msg=“REST API daemon:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket

t=2019-06-19T16:15:13-0600 lvl=info msg=" - binding TCP socket" socket=[::]:8440

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing global database”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Initializing storage pools”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Applying patch: storage_api_rename_container_snapshots_dir_again”

t=2019-06-19T16:15:13-0600 lvl=eror msg=“Failed to mount DIR storage pool “/var/lib/snapd/hostfs/media/lxd-pool2” onto “/var/snap/lxd/common/lxd/storage-pools/lxd-pool2”: no such file or directory”

t=2019-06-19T16:15:13-0600 lvl=eror msg=“Failed to start the daemon: no such file or directory”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Starting shutdown sequence”

t=2019-06-19T16:15:13-0600 lvl=info msg=“Stopping REST API handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=[::]:8440

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket

t=2019-06-19T16:15:13-0600 lvl=info msg=“Stopping /dev/lxd handler:”

t=2019-06-19T16:15:13-0600 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

t=2019-06-19T16:15:13-0600 lvl=info msg=“Closing the database”

gpatel-fr · June 19, 2019, 10:47pm

it seems that the problem is here. It’s disturbing to see the ‘patch’ that is supposed I think to fix the btrfs problem just before the dir storage mount problem. There was also a change in the dir storage in 3.14 to allow quotas. It’s unclear what could cause what. Do you have the same kind of message about the dir storage in the previous days (before the btrfs patch) - see previous lxd.log files.

stgraber · June 19, 2019, 11:32pm

This error suggests that one of those two paths is missing:

/media/lxd-pool2
/var/snap/lxd/common/lxd/storage-pools/lxd-pool2

My guess would usually be on the former as that’s not under LXD’s control but the latter would explain that too.

stgraber · June 19, 2019, 11:36pm

LXD 3.14 has a patch which must run on initial startup and which corrects some issues with some containers snapshots on all storage pools. As a result of that, all storage pools get brought up so they can be inspected.

In your case, it appears that one of your storage pools isn’t currently functional (cannot be mounted due to a missing path) which causes the update to fail and LXD to refuse to start.

Resolving the source of the problem (likely /media/lxd-pool2 being missing for some reason) will cause LXD’s patch to apply and LXD to come back online.

If that’s a pool which shouldn’t exist, you can workaround the current issue by just creating the missing directory and once LXD is online, delete the storage pool through LXD, at which point you can safely get rid of the directory.

We had a few similar reports of issues with this update where people suddenly noticed they had created a whole bunch of storage pools backed by directories in /tmp, as those obviously don’t survive a reboot and the user forgot to delete those pools from LXD, things broke when LXD now had to get them all mounted to inspect their content.
The same trick of creating the needed directories and starting LXD again resolved that at which point the pools were deleted by the user from LXD to avoid further issues down the line.