LXC cluster fault tolerance

For proper fault tolerance with LXD, you’d normally want a setup with 3 servers so LXD’s own database and API can be fault tolerant (you can lose one node and still have quorum).

Obviously that doesn’t do you much good as far as access to the data of the instances on the now dead node. For that, you’d usually want something like Ceph for storage. It can similarly use 3 nodes or more to provide fault tolerant storage. Data is replicated a number of times and all nodes have access to that cluster.

In such a setup, when you lose a system, the containers will be marked as UNKNOWN state, you can then use lxc move to relocate them onto a system that’s online and start them back up. As no data is actually being moved, it’s very quick and allows for easy handling of both rolling restarts and outages.