The problem with rebooting servers losing cluster has not been fixed

The Latest version of LXD are better, but still If I shutdown my whole cluster of 4 machines and reboot them, then I still have to kick LXD a few times before they start talking with each other. This is an issue that has to be addressed.

The problem is related to the catch 22 issue that a single server wont run cluster software so they lock in place waiting for someone to be master, and none do anything. They just sit there. I believe if one node came up and made itself cluster master then the other would attach itself to it. Right now they get stuck on the way up, because no one knows who goes first.

MAKE LXD WORK if only one machine. This would also help in case of an emergency where cluster is broken for other reason.

Last time you reported this issue we concluded that it was because the network interface you are binding your LXD daemon to was not up when the LXD daemon was started.

Did you fix your boot sequence to address this problem?

Self-electing a single node is not an option at the moment: we want to be sure to have a consistent state in case of partitions and the only way to achieve that is via a quorum (as per the CAP theorem). Our data model is currently not a good fit for eventual consistency or reconciliation/confict-resolution strategies, and I don’t think we’ll work on this anytime soon.

Make sure your boot sequence is working, and if you want your cluster to remain available don’t reboot a majority of cluster at the same time. From what we can tell, these are fair requirements that other distributed systems impose and that most users are ok with.

Note that if you reboot a majority of nodes, that’s still okay, the cluster will recover after a majority comes back online, with no manual intervention, as long as the boot sequence is correct and networking is setup before starting LXD.

I have tried waiting to start lxd until after server fully boots and problem is still there. The problem is in the first machine, not going past first step.
The reason I have to bring down whole cluster is that all my machines are multiple servers in one box. So if I have to power off a server, like I actually have to do tomorrow then I lose all cluster members at one time. I could in the future step some kind of remote cluster machine, to keep cluster alive while I power down this server. But that is just silly.
I don’t think my problem is unique but rather a source of many of the problems on this forum.
The inability of the cluster to recover from a confused node causes the whole cluster to crap out leaving the system admin looking at a major disaster. When you have a hundred of people calling you that a whole cluster is down because of one machine. It is very critical.

As said, if a majority of nodes is offline, the cluster won’t be available. As long as a majority goes back online you should be back in business. If it’s not your case, please send us details as you have in the past and we’ll fix them as usual.

As long as you start the other lxd nodes, this one should get back online.

They are all stuck… that is the problem. I can unstick them with multiple pkill -9 -f “lxd --logfile” and systemctl start snap.lxd.daemon

Eventually doing this gets it going, not very elegant and worse I am afraid sooner or later it won’t work.

Are sure that the network interface was up when those lxd processes started?

If yes, please post the logs of just the run in which they got stuck, preferably with debug turned on (assuming you can reproduce the situation).

Which are the logs you are looking for?

The ones in /var/snap/lxd/common/lxd/logs/lxd.log.

Note that this exercise is just a waste of time unless you can confirm that the network interface was up when you started the lxd process and that binding the configured address worked fine.

I haven’t had time to go through testing process but here is something interesting that might related.
On my production 4 unit cluster, the switch was unplug for a couple seconds. After it was powered up which took seconds, all units gave the same can’t find others servers thing. It would not recover in 30 minutes. However, doing pkill -9 -f “lxd --logfile” somehow got LXD back into sync. Very similar to what happens when it powers up. Yet, the servers were never down. Interesting quirk show Lxd communications is not recovering. Client & Server version: 3.22