My 5 node lxc 3.3 snap cluster got auto refreshed last night to lxc 3.4 and it all stopped working.
I tried restarting systemctl restart snap.lxd.daemon but it wasn’t working and this command stalled as well.
In the end i force rebooted all machines and i got few of them working.
Is there any way to make sure this doesn’t happen or disable snap auto refresh for lxd?
By design - - - you CANNOT disable snap auto refresh!
Both lxd and snap force refreshes. There is no reliable method to stop this process AFAIK.
You can set up snapd so it refreshes/updates only once per month but interesting things start to happen to your system when those are hindered. Ask me how I know!!!
It may be possible to fork the software (would need to do both snapd AND lxd) to achieve such (far beyond my skill set!) which could be an interesting exercise for someone capable of doing such.
@Shantur_Rathore LXD clustering requires all nodes to run on the same version, as soon as one node gets ahead of the rest of the cluster, the cluster effectively starts holding incoming requests until it becomes consistent again. This is a bit frustrating with the snapd update schedule and effectively requires the operator to manually run snap refresh lxd when this happens or wait until all machines have auto-updated (LXD itself has an insane timeout to allow for that).
I believe that lxc cluster list should still work when that happens.
What’s the current state of your cluster? You said a few machines are working again, does that mean you still have nodes that are offline?
I did a “lxd --version” and “lxc --version” on all servers and they are all running lxd 3.4 and lxc 3.4
I also did a snap refresh lxd (for good measure), but as expected there’s no updates available.
Please run snap refresh lxd on all servers again as I’ve just pushed a new build which includes some clustered database fixes that may or may not be what you were running into.
With that done, if the cluster doesn’t come back to life, it’d be great if you could provide, for all your nodes:
ps fauxww
cat /var/snap/lxd/common/lxd/logs/lxd.log
journalctl -u snap.lxd.daemon -n 300
Feel free to e-mail those directly to me (stgraber at ubuntu dot com) if you feel anything in there shouldn’t be posted on a public forum.
You have a leftover LXD process which is causing issues on this node at least.
It’s got PID 5312, I’d recommend you run kill -9 5312 and then run systemctl reload snap.lxd.daemon.
For the other nodes, the output would still be useful to see what they’re stuck on.
Refresh on the other servers took much longer, but they finished too
I did “systemctl reload snap.lxd.daemon”
lxc list still hangs
tryied “systemctl restart snap.lxd.daemon” but now they are all 4 stuck on this restart.
In the other thread I said that for one cluster I solved by killing and restarting lxd.
And as far as I can see the lxc containers are not affected.
How bad is this solution?
I can send you log files for a different server, if you still want to see them.
Stay away from systemctl restart as that will restart all your containers (if successful). systemctl reload snap.lxd.daemon is in general safe to do (will not affect running containers).
As your initial system was showing some conflicting daemons running, you could run:
pkill -9 -f "lxd --logfile"
On all of them, which will kill any LXD running on there, LXD should then get auto-restarted with a single good copy of it on each node at which point the cluster should come back online.
If you try that option, let me know how it goes as I’m preparing an update to the snap which will make it much more aggressive towards any leftover LXD process that’s found during startup (to try and resolve any such issue).
Ok, I think we have a good idea of everything that went wrong in 3.4 and most of those are fixed already. I’m working on the last big bug, then the last thing will be to get a better way to do upgrades of whole clusters, I hope to have a solution for that in time for the 3.5 upgrade next month.