LXD clustering - instance recovery with Ceph

I have been researching LXD clustering recently and came accross this post, in the comments it was stated that by using a Ceph storage pool we could restart the containers from a broken node onto the functioning nodes within the cluster.

My question is regarding how the state of the container is maintained, say for example the container was doing some writing on a database and the node crashes. Would restarting the container on another node leave it in a broken state? How are the states of the instances maintained across all nodes in a cluster?

The running state isn’t maintained and you will lose that.
Most filesystems use proper journaling and most workloads should use filesystem semantics to avoid corrupting it on abrubt stop, but you may indeed lose the last few write operations.

Once a node is dead, you can move the now STOPPED/OFFLINE instance to another with lxc move NAME --target NEW-TARGET and then start it back with lxc start NAME. It will boot up like a normal boot.

Thank you for the clarification!

Just to be sure, the lxc move command from a dead node can only work with Ceph storage driver correct?

Also, how are container’s data persisted accross all nodes in the cluster?

Correct, this is only possible when the data is stored in Ceph, if it’s not, then the source must be online in order for the data to be accessible.

Ceph itself takes care of the replication of data, it can either be located on the same node as your compute or be a separate cluster, so long as ceph itself doesn’t go offline because of having too many nodes/drives dead, your data is accessible from any LXD node.

Is it possible to create a storage pool using Ceph inside a VM? I have tried initializing a 3 node cluster with 3 LXD Ubuntu 20.04 VMs for testing purposes, however and error occurs similar to the one reported here whether I try to create the pool during the bootstrap node initialization or after the cluster has been created with lxd storage create ceph ceph.

Yep, ceph works fine inside VMs, our test cluster for development is running in 3 LXD VMs.

Ok, thank you, I will try to troubleshoot it then.