Snap auto refresh kills cluster

Hi,

My 5 node lxc 3.3 snap cluster got auto refreshed last night to lxc 3.4 and it all stopped working.
I tried restarting systemctl restart snap.lxd.daemon but it wasn’t working and this command stalled as well.
In the end i force rebooted all machines and i got few of them working.

Is there any way to make sure this doesn’t happen or disable snap auto refresh for lxd?

Thanks

By design - - - you CANNOT disable snap auto refresh!
Both lxd and snap force refreshes. There is no reliable method to stop this process AFAIK.

You can set up snapd so it refreshes/updates only once per month but interesting things start to happen to your system when those are hindered. Ask me how I know!!!

It may be possible to fork the software (would need to do both snapd AND lxd) to achieve such (far beyond my skill set!) which could be an interesting exercise for someone capable of doing such.

This is incorrect, there is no logic whatsoever in LXD that force auto-refresh. It’s purely a snapd behavior.

@Shantur_Rathore LXD clustering requires all nodes to run on the same version, as soon as one node gets ahead of the rest of the cluster, the cluster effectively starts holding incoming requests until it becomes consistent again. This is a bit frustrating with the snapd update schedule and effectively requires the operator to manually run snap refresh lxd when this happens or wait until all machines have auto-updated (LXD itself has an insane timeout to allow for that).

I believe that lxc cluster list should still work when that happens.

What’s the current state of your cluster? You said a few machines are working again, does that mean you still have nodes that are offline?

Hi,
I believe I’m having the same problem here None of the lxc commands working after apt upgrade on ubuntu bionic

I did a “lxd --version” and “lxc --version” on all servers and they are all running lxd 3.4 and lxc 3.4
I also did a snap refresh lxd (for good measure), but as expected there’s no updates available.

Is there anything else I can try?

Hi,

Please run snap refresh lxd on all servers again as I’ve just pushed a new build which includes some clustered database fixes that may or may not be what you were running into.

With that done, if the cluster doesn’t come back to life, it’d be great if you could provide, for all your nodes:

  • ps fauxww
  • cat /var/snap/lxd/common/lxd/logs/lxd.log
  • journalctl -u snap.lxd.daemon -n 300

Feel free to e-mail those directly to me (stgraber at ubuntu dot com) if you feel anything in there shouldn’t be posted on a public forum.

one server did the refresh, the other 3 are stuck here:

$ snap refresh lxd
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                  
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services                                                                                                                                                                                                                                         
Stop snap "lxd" services

I’ve emailed you the files for the server that refreshed correctly, in here I can just upload images apparently.
Last test from the terminal returns:

$ lxc list
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error

Hi,

You have a leftover LXD process which is causing issues on this node at least.
It’s got PID 5312, I’d recommend you run kill -9 5312 and then run systemctl reload snap.lxd.daemon.

For the other nodes, the output would still be useful to see what they’re stuck on.

Refresh on the other servers took much longer, but they finished too
I did “systemctl reload snap.lxd.daemon”
lxc list still hangs
tryied “systemctl restart snap.lxd.daemon” but now they are all 4 stuck on this restart.

In the other thread I said that for one cluster I solved by killing and restarting lxd.
And as far as I can see the lxc containers are not affected.
How bad is this solution?

I can send you log files for a different server, if you still want to see them.

Stay away from systemctl restart as that will restart all your containers (if successful).
systemctl reload snap.lxd.daemon is in general safe to do (will not affect running containers).

As your initial system was showing some conflicting daemons running, you could run:

pkill -9 -f "lxd --logfile"

On all of them, which will kill any LXD running on there, LXD should then get auto-restarted with a single good copy of it on each node at which point the cluster should come back online.

If you try that option, let me know how it goes as I’m preparing an update to the snap which will make it much more aggressive towards any leftover LXD process that’s found during startup (to try and resolve any such issue).

I killed systemctl restart, it was stuck anyway.
I wasn’t too worried about containers.

I did the pkill command on all servers and now they work.

Excellent, so my planned fix should take care of any such issues, that’s good to know.

Thanks for the support

Thanks for replying and solving this issue.
The machines which didn’t come up, I made sure correct version is installed and rebooted them again.

Ok, I think we have a good idea of everything that went wrong in 3.4 and most of those are fixed already. I’m working on the last big bug, then the last thing will be to get a better way to do upgrades of whole clusters, I hope to have a solution for that in time for the 3.5 upgrade next month.

Till that time
127.0.0.1 api.snapcraft.io in /etc/hosts will do the job.

Note that you’ll want to do an update now to get the database fixes we’ve just pushed to stable.

Thanks for letting us know