Snapd auto-refresh has removed lxc and disabled lxd!

rkelleyrtp · July 6, 2019, 1:36am

We are running 16 LXC servers on Ubuntu 18.04 with lxc/lxd 3.14. On July 1st, snap refresh did an auto-refresh on one of our servers and completely removed LXC and disabled LXD. The /var/snap directory only shows “core” and “lxd”. It seems our containers are still running but we cannot manage or monitor them.

snap list shows this:

root@Container-Server-002:/var/snap# snap list
Name  Version    Rev    Tracking  Publisher   Notes
core  16-2.39.3  7270   stable    canonical✓  core
lxd   3.14       10972  stable/…  canonical✓  disabled
root@HJ-WP-Container-002:/var/snap#

How can I get LXC and LXD working again? What debugging output can I provide to get this resolved as soon as possible (this is a production server)? We can’t reboot the server because the containers won’t get restarted (lxd is disabled and lxc is missing).

rkelleyrtp · July 6, 2019, 1:48am

Also, “snap changes” shows this:

root@Container-Server-002:/snap# snap changes
ID   Status  Spawn                     Ready               Summary
13   Abort   4 days ago, at 22:27 UTC  -                   Auto-refresh snap "lxd"
14   Done    today at 01:24 UTC        today at 01:24 UTC  Refresh all snaps: no updates
15   Done    today at 01:46 UTC        today at 01:46 UTC  Refresh all snaps: no updates

stgraber · July 6, 2019, 3:38am

Ok, so first thing would be to get more details on the aborted refresh with snap change 13.

This may point towards some serious issue (I/O error, disk full, …) which should be addressed first.
If nothing like that is in there, then you can try a snap refresh lxd, which should get you to revision 11098.

I’d expect a successful refresh would re-enable the snap, but if it doesn’t, then snap enable lxd may help there.

rkelleyrtp · July 6, 2019, 4:01am

Thanks Stephane.

Here is the output from “snap change 13”

root@Container-Server-002:/var/snap/lxd/storage-pools/default/containers# snap change 13
Status  Spawn                     Ready               Summary
Done    4 days ago, at 22:27 UTC  today at 02:14 UTC  Ensure prerequisites for "lxd" are available
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Download snap "lxd" (11098) from channel "stable/ubuntu-18.10"
Done    4 days ago, at 22:27 UTC  today at 02:14 UTC  Fetch and check assertions for snap "lxd" (11098)
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Mount snap "lxd" (11098)
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Run pre-refresh hook of "lxd" snap if present
Error   4 days ago, at 22:27 UTC  today at 02:14 UTC  Stop snap "lxd" services
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Remove aliases for snap "lxd"
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Make current revision for snap "lxd" unavailable
Undone  4 days ago, at 22:27 UTC  today at 02:14 UTC  Copy snap "lxd" data
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Setup snap "lxd" (11098) security profiles
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Make snap "lxd" (11098) available to the system
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Automatically connect eligible plugs and slots of snap "lxd"
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Set automatic aliases for snap "lxd"
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Setup snap "lxd" aliases
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Run post-refresh hook of "lxd" snap if present
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Start snap "lxd" (11098) services
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Remove data for snap "lxd" (10934)
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Remove snap "lxd" (10934) from the system
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Clean up "lxd" (11098) install
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Run configure hook of "lxd" snap if present
Hold    4 days ago, at 22:27 UTC  today at 01:45 UTC  Consider re-refresh of "lxd"

......................................................................
Stop snap "lxd" services

2019-07-06T02:14:55Z ERROR [start snap.lxd.activate.service] failed with exit status 1: Job for snap.lxd.activate.service failed because the control process exited with error code.
See "systemctl status snap.lxd.activate.service" and "journalctl -xe" for details.

After further debugging, the root of the issue is the snap core. After trying to get the lxd service restarted, it seems I no longer have the /snap/core directory on this server, and no amount of apt purge snapd or apt remove/install snapd will fix this problem. The output from “snap list”

Name  Version  Rev    Tracking  Publisher   Notes
core           7270   stable    canonical✓  broken
lxd   3.14     10972  stable/…  canonical✓  -

It seems the only way I can solve this problem is manually copying the containers to a new server. Maybe you know some magic to get the snapd core program installed again?

stgraber · July 6, 2019, 4:08am

What happens if you do a snap refresh core or snap install core at this point?

Also anything relevant to core in snap changes?

rkelleyrtp · July 6, 2019, 4:19am

Can’t update the snap core.

root@Container-Server-002:/usr/local/bin# snap refresh core 
snap "core" has no updates available

root@Container-Server-002:/usr/local/bin# snap install core
snap "core" is already installed, see 'snap help refresh'

BTW - As a troubleshooting method, I cloned this VM, made a snapshot, and then removed the snap app via apt purge snapd. This resulted in removing ALL the container directories in /var/snap/lxd/storage-pools/default/containers. I reverted to the prior snapshot, umounted /var/snap/lxd/storage-pools/default, and then tried “apt purge snapd” again. This time, both LXD and snapd were removed. I reinstalled snapd via apt install snapd, but the system complained the /snap/core directory was invalid.

Something bad happened to this server during the auto-refresh process of snapd, and I can’t determine the exact cause. Once snapd dies, lxd goes with it. This is very, very bad. And, it does not seem to be an isolated case. A quick google search led to a “few” people having problems with snapd and app unavailability.

At this point, I can manually move the containers to another server, but I can’t get any specific profile details per container.

BTW - I consider myself to be a fairly good server admin, but considering how I am unable to revive the system from a snapd failure, I am really considering moving away from Ubuntu and back to Debian without snapd.

stgraber · July 6, 2019, 4:21am

Try doing:

snap refresh core --edge
snap refresh core --stable

rkelleyrtp · July 6, 2019, 4:27am

root@Container-Server-002:/usr/local/bin# snap refresh core --edge
2019-07-06T04:23:19Z INFO Waiting for restart...
core (edge) 16-2.40~pre1+git1384.7c34ee7 from Canonical✓ refreshed

root@Container-Server-002:/usr/local/bin# snap list
Name  Version                       Rev    Tracking  Publisher   Notes
core  16-2.40~pre1+git1384.7c34ee7  7353   edge      canonical✓  core
lxd   3.14                          10972  stable/…  canonical✓  -

root@Container-Server-002:/usr/local/bin# snap refresh core --stable
Copy snap "core" data                                                                                   \

It seems the "Copy snap “core” data task just runs forever. It has been “spinning” for about 3mins now. Also, I noticed this server has huge disk IO wait times but I have not tracked down the culprit. I suspect something with lxd and/or btrfs…

stgraber · July 6, 2019, 4:33am

Okay, so that seems to unstick the core snap at least, once that completes you’ll be back on the stable core snap.

LXD itself doesn’t show up as disabled now so that’s promising too.

Can you show ps fauxww to see exactly what’s running and what’s not?

rkelleyrtp · July 6, 2019, 4:43am

Seems we have some nightly backup processes stuck. This is what’s consuming all disk IO. I am trying to kill those right now.

rkelleyrtp · July 6, 2019, 4:54am

I cloned the VM again and ran the “snap refresh core --edge” command then the “snap refresh core --stable” command. Here is the output:

root@Container-002:/var/snap/lxd/common# snap refresh core --stable
error: cannot perform the following tasks:
- Setup snap "core" (7270) security profiles (cannot find installed snap "core" at revision 7270: 
missing file /snap/core/7270/meta/snap.yaml)

root@Container-002:/var/snap/lxd/common# ls -la /snap/core
total 8
drwxr-xr-x  3 root root 4096 Jul  6 04:52 .
drwxr-xr-x  5 root root 4096 Jul  6 04:23 ..
drwxr-xr-x 24 root root  321 Jul  4 04:23 7353
lrwxrwxrwx  1 root root    4 Jul  6 04:52 current -> 7353

root@Container-002:/var/snap/lxd/common# snap list
Name  Version                       Rev    Tracking  Publisher   Notes
core  16-2.40~pre1+git1384.7c34ee7  7353   edge      canonical✓  core
lxd   3.14                          11098  stable/…  canonical✓  -

It seems there is no snap version 7270 required for the “snap refresh core --stable” command.

stgraber · July 6, 2019, 4:57am

Sounds like you have that revision of the snap on your system and so snapd isn’t causing a re-download of the stable core snap but also is clearly failing at mounting the local one.

I don’t know quite enough about snapd to know the best way to get it to flush that revision and properly re-download it. You may want to snap refresh core --beta to be on something less volatile than edge, then either reach out to the snapd folks to figure out how to switch back to stable, or wait until stable is brought to a revision higher than 7270, then refresh to it then.

What’s the state of LXD with all that?

stgraber · July 6, 2019, 5:09am

I’ve got to go to bed as it’s 1am here, hopefully that core dance will have fixed whatever was blocking LXD and it will come back online just fine.

If not, you probably want to try a reboot to have everything get mounted in the right order.
As storage slowness is at play, keep an eye on dmesg for anything bad in there too.

If you really need to move the containers somewhere else, getting the directory structure in /var/snap/lxd/common/lxd/storage-pools replicated on another server will then let you import them back into LXD with lxd import NAME, you can look at our backup document which includes some instructions for such disaster recovery cases.

But so far there doesn’t appear to be anything that would prevent lxd from just starting back up. The data is still there and all that seems messed up are snap mounts getting in the way from the lxd snap getting a working filesystem.

I wonder if the issue was somehow caused by a refresh happening during one of those backups, causing it to take so long it eventually timed out? You may want to set a snap refresh window that avoid such times (or firewall off the store entirely to block refreshes, then unblock when you want to do them manually).

rkelleyrtp · July 6, 2019, 5:19am

It seems lxc is working again but none of the containers are registered.

root@Container-002:~# lxc list
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+


root@Container-002:~# lxc list
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+
root@HJ-WP-Container-002:~# snap list
Name  Version                       Rev    Tracking  Publisher   Notes
core  16-2.40~pre1+git1384.7c34ee7  7353   edge      canonical✓  core
lxd   3.14                          10972  stable/…  canonical✓  -

rkelleyrtp · July 6, 2019, 5:21am

…and, none of the LXD profiles exist.

lxc profile list
+---------+---------+
|  NAME   | USED BY |
+---------+---------+
| default | 0       |
+---------+---------+

stgraber · July 6, 2019, 10:42pm

Hmm, that’s pretty odd. Do you have some kind of mount for your data on top of /var/snap/lxd or similar which would explain why none of it is visible right now?

What do you see in /var/snap/lxd/common/lxd/containers?