LXC cluster fault tolerance

I am exploring to use LXC clusters for a research project where I have 2 nodes and in terms of reliability one of them faces some power outages.

Now while launching a cluster using LXC, I observed that my master is able to confirm that one of the nodes went down. But from the fault tolerance perspective I was exploring a way to bring the disconnected container back online on some other backup node.

Could there be a way to make it fault tolerant this way that when one container goes down then it gets automatically moved to another node? I believe thatā€™d be the actual fault tolerance. So that my processing doesnā€™t stop and thereā€™s a failover to recover in case of a downtime.

The data movement wonā€™t be an issue if I am using an NFS server as central storage.

Hi!

LXD clusters require three nodes or more. Do you do protocol research on corner cases that require just two nodes?

Hi Simos,

I am happy to go with 3 or even more nodes. I can bring my local workstation to use for the 3rd node. However, my major concern was related to outage that I face regularly on one of them.

I do not know whether the outage is caused by an issue with the specific LXD installation, or is a consequence of having just two nodes. In addition, you mention that you use NFS, which I believe means that you store your storage pools on NFS (as loop files). This might impact performance and introduce reliability issues (because there are added layers of loop files over network storage).

Personally, I would opt for a 3-or-more-nodes LXD cluster, with the storage pool on Ceph (also network storage but block-based). If you cannot arrange for Ceph, then select ZFS if on Ubuntu, or btrfs on other Linux distributions.

Thereā€™s a miscommunication here I think. Just to clarify, outage is unrelated to lxd installation. Outages are happening due to power or network downtimes on some days. But when that downtime occurs then my only option is to wait for the recovery to happen. What I was exploring was a way to make the cluster fault tolerant in a way that the node that went down gets automatically replaced by another backup node and the cluster launches a replica container on that backup node so that no time is wasted in my process and the cluster in a way manages fault automatically.

NFS is there just because I needed a way for the cluster to access data on different nodes and also in case of a downtime my best bet would be to fetch data from NFS on the backup node instead of moving data from one node to another. But if thereā€™s a better way then Iā€™d definitely be interested in that.

Replacing a unhealthy container would involve the creation of a new container itself and also retrieval of the data that was present in it.

For proper fault tolerance with LXD, youā€™d normally want a setup with 3 servers so LXDā€™s own database and API can be fault tolerant (you can lose one node and still have quorum).

Obviously that doesnā€™t do you much good as far as access to the data of the instances on the now dead node. For that, youā€™d usually want something like Ceph for storage. It can similarly use 3 nodes or more to provide fault tolerant storage. Data is replicated a number of times and all nodes have access to that cluster.

In such a setup, when you lose a system, the containers will be marked as UNKNOWN state, you can then use lxc move to relocate them onto a system thatā€™s online and start them back up. As no data is actually being moved, itā€™s very quick and allows for easy handling of both rolling restarts and outages.

The newest versions of LXD are much much better than previous versions at handling a improper shutdown. I was able to upgrade and reboot my servers yesterday, one at a time with confidence. But I dare not shut them down all at one time or have a power failure type event. The LXD developers keep insisting that they need three in a Cluster. While this is good in theory, when you have thousands of customers calling you because they canā€™t access their sites. I donā€™t give a crap what is best, I want even one LXD to run even if it doesnā€™t synch with the other. A kind of Safe mode. I keep pushing for that.

To me if you can start with one and then add a second, and then add a third, you should be able to start with 4, take 2 down and the remaining 2 should work. I should also be able to move a server from one cluster to another just like you can move a container from one server to another. Otherwise LXD, while a fabulous product, and one that has improve tremendously, is not fault-tolerant, or even real world admin friendly.

For example, when I shutdown a server in a cluster of 6 nodes, I should be able to select the next master node, not just let it do itself. May be there is a way. And I believe there should be a MASTER-MASTER node that no matter what is always the chief of the quorum. And this one can run by itself or with a second.

I do have to congratulate the team of developers here. The latest changes have made at least upgrades and reboot much less of a problem, and almost fool proof.

I donā€™t see us changing this logic any time soon. Itā€™s at the core of all modern fault tolerant system using consensus algorithms.

This kind of systems completely prevent data corruption and impossible merges in the event of a network partition by always requiring a majority of the voter systems to agree to any transaction. This allows for continued operation of the cluster when a node drops off while also ensuring that the node which lost access to the cluster will not keep performing any mutating database access.

When the node regains access to the cluster, no data merging is needed as it couldnā€™t perform any writes in the first place, it instead simply retrieves the transactions that happened since it last communicated with other cluster nodes and moves on.

Iā€™m not sure what you are referring to as far as users losing access to their containers.
LXD never shuts down the instances during updates or when losing access to the rest of the cluster. You wonā€™t be able to use the LXD API to perform actions on a system which has lost access to the rest of the cluster, but the instances that were running on it will still be running.

As we start operating more and more production clusters for our own use and for our customers, weā€™ve been quite busy fixing a variety of issues. These days those are getting more and more niche which is a good sign that by and large things are working.
For the past 4 or 5 LXD releases, all 5 clusters Iā€™m running (ranging from 3 to 24 nodes) have been self-upgrading without a hitch and our daily cluster upgrade tests have been reflecting that too: https://jenkins.linuxcontainers.org/job/lxd-test-cluster/

We have some planned work to make manual reconfiguration of the raft cluster a bit easier, looking at something similar to editing the monmap in Ceph. That should save the LXD team a bit of support time as for the most broken cases today, we need the user to ship us a copy of the database so we can manually edit itā€¦

Weā€™ve also recently had @mbordere join the team who will be taking over maintenance of libraft/libdqlite/go-dqlite with a focus on quality, testing and performance. Thatā€™s the clustering/database stack used by LXD, Anbox and microk8s.

4 Likes

Sorry for the delay in getting back and Thanks for sharing a detailed response on this. I am starting with ceph as an option as I have never tried this before. Although I while exploring ceph recently, I came to know that it needs around 10GB Ethernet connection between nodes for decent performance. I believe thatā€™d be quite a challenge in my setup as my nodes are not connected to each other.

Iā€™ll get back on this thread post my test results.

You canā€™t get around the fact that all cluster systems have a need for Quorum and that needs an odd number, pretty sure its just maths etc.

However, an idea would be to have something in LxD similar to Proxmox-VE

They have a lightweight service you can install on a server called ā€œQ-nodeā€, its basically just a quorum node that takes a copy of the corosync db, that you would install on some super lightweight instance, possibly in the cloud or even a raspberry pi without having to install full blown proxmox-VE install.

In that case you join the Q-node to the cluster and it acts as the arbiter for any times there are downed nodes in the cluster, thus you can have 2 full-fat servers and this Q-node vs having to shell out for 3 full fat servers.

Just an idea anyway, maybe its possible to do something similar at the moment but Iā€™ve not delved into LXD clustering much.

Cheers :slight_smile:

Jon.

You canā€™t get around the fact that all cluster systems have a need for Quorum and that needs an odd number, pretty sure its just maths etc.

Sure you should be able to, just like you can cluster servers one by one, starting with one, there should be a way to uncluster them. Individually or in Mass. If you have 5 or 5000, you should be able to bring them in and out of a cluster whether on purpose or by accident without it causing a major problem to the rest of the Cluster. Let me tell you some of many real world examples that have happen to me with LXD. But first let me say, many of these issues have been resolved and the help of many here has invaluable at getting problem solved.

  1. I added a server to cluster and then this server was no longer needed and shutdown. Everyone forgot about it . When we upgraded server and reboot servers the cluster would not come up until with help of the ā€˜LXD Tech Gurusā€™ they figured out it was about a missing member and manually deleted from database. 1/2 day downtime for whole cluster

  2. Power failure to cabinet caused all servers to go down at one time. Database was all confused. It took restoring database and help from ā€˜LXD Tech Gurusā€™ to manually fix database. 1 day downtime.

  3. Multiple issues in the past with cluster getting stock if one member can offline or upgrade did not complete properly. Many days of production down time waiting for a possible fix. Some times we erase cluster and start from scratch. Has happen at least 4 times and have countless days of down time.

We maintain four members in cluster minimum so that I can upgrade one at a time and still maintain 3 at all time. Unfortunately, these are all on one physical server in a datacenter with dual power. And not matter how redundant it is a failure of power is possible.

I am stronger believer that LXD would be a better product if it had a Safe mode in which a server could be taken off the cluster and run locally without LXD. Right now if containers are running, you can kill LXD and they still run, you just canā€™t start or stop them. Something as simple as being able to start and stop container manually would add a lot of confidence to being able to keep a production server running in case of a major failure of any kind.

To me bullet proof and idiot proof reliability is far important than a few more features that will probably not use. I dread having that phone that the cluster is down, because you never know how long it is going to take to figure out what went wrong. And yes a lot of this is history, but some of the basic causes, the lack of flexibility in adjusting configuration is still there.