We are running 16 LXC servers on Ubuntu 18.04 with lxc/lxd 3.14. On July 1st, snap refresh did an auto-refresh on one of our servers and completely removed LXC and disabled LXD. The /var/snap directory only shows “core” and “lxd”. It seems our containers are still running but we cannot manage or monitor them.
snap list shows this:
root@Container-Server-002:/var/snap# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.39.3 7270 stable canonical✓ core
lxd 3.14 10972 stable/… canonical✓ disabled
root@HJ-WP-Container-002:/var/snap#
How can I get LXC and LXD working again? What debugging output can I provide to get this resolved as soon as possible (this is a production server)? We can’t reboot the server because the containers won’t get restarted (lxd is disabled and lxc is missing).
root@Container-Server-002:/snap# snap changes
ID Status Spawn Ready Summary
13 Abort 4 days ago, at 22:27 UTC - Auto-refresh snap "lxd"
14 Done today at 01:24 UTC today at 01:24 UTC Refresh all snaps: no updates
15 Done today at 01:46 UTC today at 01:46 UTC Refresh all snaps: no updates
Ok, so first thing would be to get more details on the aborted refresh with snap change 13.
This may point towards some serious issue (I/O error, disk full, …) which should be addressed first.
If nothing like that is in there, then you can try a snap refresh lxd, which should get you to revision 11098.
I’d expect a successful refresh would re-enable the snap, but if it doesn’t, then snap enable lxd may help there.
root@Container-Server-002:/var/snap/lxd/storage-pools/default/containers# snap change 13
Status Spawn Ready Summary
Done 4 days ago, at 22:27 UTC today at 02:14 UTC Ensure prerequisites for "lxd" are available
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Download snap "lxd" (11098) from channel "stable/ubuntu-18.10"
Done 4 days ago, at 22:27 UTC today at 02:14 UTC Fetch and check assertions for snap "lxd" (11098)
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Mount snap "lxd" (11098)
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Run pre-refresh hook of "lxd" snap if present
Error 4 days ago, at 22:27 UTC today at 02:14 UTC Stop snap "lxd" services
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Remove aliases for snap "lxd"
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Make current revision for snap "lxd" unavailable
Undone 4 days ago, at 22:27 UTC today at 02:14 UTC Copy snap "lxd" data
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Setup snap "lxd" (11098) security profiles
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Make snap "lxd" (11098) available to the system
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Automatically connect eligible plugs and slots of snap "lxd"
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Set automatic aliases for snap "lxd"
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Setup snap "lxd" aliases
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Run post-refresh hook of "lxd" snap if present
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Start snap "lxd" (11098) services
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Remove data for snap "lxd" (10934)
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Remove snap "lxd" (10934) from the system
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Clean up "lxd" (11098) install
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Run configure hook of "lxd" snap if present
Hold 4 days ago, at 22:27 UTC today at 01:45 UTC Consider re-refresh of "lxd"
......................................................................
Stop snap "lxd" services
2019-07-06T02:14:55Z ERROR [start snap.lxd.activate.service] failed with exit status 1: Job for snap.lxd.activate.service failed because the control process exited with error code.
See "systemctl status snap.lxd.activate.service" and "journalctl -xe" for details.
After further debugging, the root of the issue is the snap core. After trying to get the lxd service restarted, it seems I no longer have the /snap/core directory on this server, and no amount of apt purge snapd or apt remove/install snapd will fix this problem. The output from “snap list”
Name Version Rev Tracking Publisher Notes
core 7270 stable canonical✓ broken
lxd 3.14 10972 stable/… canonical✓ -
It seems the only way I can solve this problem is manually copying the containers to a new server. Maybe you know some magic to get the snapd core program installed again?
root@Container-Server-002:/usr/local/bin# snap refresh core
snap "core" has no updates available
root@Container-Server-002:/usr/local/bin# snap install core
snap "core" is already installed, see 'snap help refresh'
BTW - As a troubleshooting method, I cloned this VM, made a snapshot, and then removed the snap app via apt purge snapd. This resulted in removing ALL the container directories in /var/snap/lxd/storage-pools/default/containers. I reverted to the prior snapshot, umounted /var/snap/lxd/storage-pools/default, and then tried “apt purge snapd” again. This time, both LXD and snapd were removed. I reinstalled snapd via apt install snapd, but the system complained the /snap/core directory was invalid.
Something bad happened to this server during the auto-refresh process of snapd, and I can’t determine the exact cause. Once snapd dies, lxd goes with it. This is very, very bad. And, it does not seem to be an isolated case. A quick google search led to a “few” people having problems with snapd and app unavailability.
At this point, I can manually move the containers to another server, but I can’t get any specific profile details per container.
BTW - I consider myself to be a fairly good server admin, but considering how I am unable to revive the system from a snapd failure, I am really considering moving away from Ubuntu and back to Debian without snapd.
root@Container-Server-002:/usr/local/bin# snap refresh core --edge
2019-07-06T04:23:19Z INFO Waiting for restart...
core (edge) 16-2.40~pre1+git1384.7c34ee7 from Canonical✓ refreshed
root@Container-Server-002:/usr/local/bin# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.40~pre1+git1384.7c34ee7 7353 edge canonical✓ core
lxd 3.14 10972 stable/… canonical✓ -
root@Container-Server-002:/usr/local/bin# snap refresh core --stable
Copy snap "core" data \
It seems the "Copy snap “core” data task just runs forever. It has been “spinning” for about 3mins now. Also, I noticed this server has huge disk IO wait times but I have not tracked down the culprit. I suspect something with lxd and/or btrfs…
Sounds like you have that revision of the snap on your system and so snapd isn’t causing a re-download of the stable core snap but also is clearly failing at mounting the local one.
I don’t know quite enough about snapd to know the best way to get it to flush that revision and properly re-download it. You may want to snap refresh core --beta to be on something less volatile than edge, then either reach out to the snapd folks to figure out how to switch back to stable, or wait until stable is brought to a revision higher than 7270, then refresh to it then.
I’ve got to go to bed as it’s 1am here, hopefully that core dance will have fixed whatever was blocking LXD and it will come back online just fine.
If not, you probably want to try a reboot to have everything get mounted in the right order.
As storage slowness is at play, keep an eye on dmesg for anything bad in there too.
If you really need to move the containers somewhere else, getting the directory structure in /var/snap/lxd/common/lxd/storage-pools replicated on another server will then let you import them back into LXD with lxd import NAME, you can look at our backup document which includes some instructions for such disaster recovery cases.
But so far there doesn’t appear to be anything that would prevent lxd from just starting back up. The data is still there and all that seems messed up are snap mounts getting in the way from the lxd snap getting a working filesystem.
I wonder if the issue was somehow caused by a refresh happening during one of those backups, causing it to take so long it eventually timed out? You may want to set a snap refresh window that avoid such times (or firewall off the store entirely to block refreshes, then unblock when you want to do them manually).
Hmm, that’s pretty odd. Do you have some kind of mount for your data on top of /var/snap/lxd or similar which would explain why none of it is visible right now?
What do you see in /var/snap/lxd/common/lxd/containers?