LXD Fault tolerance

derek423 · January 11, 2023, 4:07pm

I am looking at testing a Microcloud/ceph cluster. I am curious, will running my containers in a cluster with ceph storage provide any fault tolerance? As in if a host falls over, will the containers keep running on another member of the cluster? I don’t think I can rely on criu (systemd containers) for live migration so not sure if redundant storage will provide any redundancy for my use case?

Thank you

stgraber · January 11, 2023, 5:31pm

They won’t keep running but you also won’t lose any data.
They’ll go into ERROR state as the machine they were running on is now dead.

From there, you can do lxc move NAME --target OTHER-MACHINE and then lxc start NAME to get them back online.

We’re working on automating that part so that you can configure a threshold after which LXD will consider a machine to be “dead enough” to move its workloads elsewhere.

In all cases, this will mean the instance will be restarted as there’s no way to have the CPU state itself be kept in sync between multiple systems to allow for seamless recovery in this kind of situation.