LXD Clustering: issues with container failover with node failure

LXD clustering lets you get a unified view of multiple LXD nodes, effectively turning them into one big LXD host.

The database is replicated and HA so that restarting a node will not interrupt the LXD API, you can still list containers, reconfigure them, spawn new ones, …

But anything that’s directly stored on the node that’s gone away cannot be reached until it’s back online. The remaining nodes don’t have a copy of the container’s data so can’t move and restart it.

The exception to this is if you’re using CEPH as your storage backend, in that case, since your storage is over the network and not tied to any of the nodes, you will be able to move a container from one node to another and restart it there even when the source node has gone offline.