LXC cluster fault tolerance

You can’t get around the fact that all cluster systems have a need for Quorum and that needs an odd number, pretty sure its just maths etc.

Sure you should be able to, just like you can cluster servers one by one, starting with one, there should be a way to uncluster them. Individually or in Mass. If you have 5 or 5000, you should be able to bring them in and out of a cluster whether on purpose or by accident without it causing a major problem to the rest of the Cluster. Let me tell you some of many real world examples that have happen to me with LXD. But first let me say, many of these issues have been resolved and the help of many here has invaluable at getting problem solved.

  1. I added a server to cluster and then this server was no longer needed and shutdown. Everyone forgot about it . When we upgraded server and reboot servers the cluster would not come up until with help of the ‘LXD Tech Gurus’ they figured out it was about a missing member and manually deleted from database. 1/2 day downtime for whole cluster

  2. Power failure to cabinet caused all servers to go down at one time. Database was all confused. It took restoring database and help from ‘LXD Tech Gurus’ to manually fix database. 1 day downtime.

  3. Multiple issues in the past with cluster getting stock if one member can offline or upgrade did not complete properly. Many days of production down time waiting for a possible fix. Some times we erase cluster and start from scratch. Has happen at least 4 times and have countless days of down time.

We maintain four members in cluster minimum so that I can upgrade one at a time and still maintain 3 at all time. Unfortunately, these are all on one physical server in a datacenter with dual power. And not matter how redundant it is a failure of power is possible.

I am stronger believer that LXD would be a better product if it had a Safe mode in which a server could be taken off the cluster and run locally without LXD. Right now if containers are running, you can kill LXD and they still run, you just can’t start or stop them. Something as simple as being able to start and stop container manually would add a lot of confidence to being able to keep a production server running in case of a major failure of any kind.

To me bullet proof and idiot proof reliability is far important than a few more features that will probably not use. I dread having that phone that the cluster is down, because you never know how long it is going to take to figure out what went wrong. And yes a lot of this is history, but some of the basic causes, the lack of flexibility in adjusting configuration is still there.