High availability LXD cluster with shared storage

tuathano · April 16, 2020, 2:42pm

Hi,

I am exploring the use of LXD for a linux cluster. I have 5 physical nodes each with a SATA drive for the boot Linux OS (and LXD host) and each will have additional 2-3 hard drives. I’d like to make it as available as possible with failover. What is the best way to configure?

My idea was to create the 5 LXD nodes but then each node will have many more containerized (LXD) identical OS for use and then pool all the storage (“JBOD”). I’ve seen CEPH mentioned in this respect but really I’m not sure the best approach.

Thanks.

stgraber · April 16, 2020, 3:23pm

Yeah, I think you’re looking at a 5 nodes LXD cluster, that gives good DB redundancy.

For storage, unless latency is a major concern of yours, I’d setup one Ceph OSD per drive in those systems, create one or more Ceph pools and give that to LXD for storage.

This should effectively allow you for (slowly) losing up to 3 of the 5 nodes with the DB moving as needed (needs at least two online) and same for the data on Ceph which would have 3 replicas spread on the cluster and be moved as nodes fail.

Note that if you suddenly (exact same time) lose the wrong two nodes, you will be offline until they recover as LXD can re-balance DB roles during clean shutdown (maintenance) but if you somehow kill two out of the three active DB nodes, there’s no more quorum and things will hang until one of them is brought back up.

Similar story with Ceph, you’ll want at least 3 Ceph monitors and I usually also run mgr/mds on the same nodes. Given a small cluster of 5, the easiest would likely be to run mon/mds/mgr on all 5 too. If the cluster gets larger you may want to keep that to just a few of the nodes though.

Ceph can expose both blocks (RBD) or filesystems (FS). LXD supports both but instances can only be backed by RBD. FS is very useful if you need shared data between instances running on different nodes though, so I usually set both up on my clusters.

tuathano · April 16, 2020, 3:52pm

Thanks.

So when you say:

setup one Ceph OSD per drive

Can I do this with the method outlined: https://ubuntu.com/blog/ceph-storage-driver-in-lxd ?

and then:

create one or more Ceph pools

do you mean on each node doing

lxc storage create

as per https://lxd.readthedocs.io/en/latest/clustering/#storage-pools .

Finally, could you elabourate a bit on:

Similar story with Ceph, you’ll want at least 3 Ceph monitors and I usually also run mgr/mds on the same nodes. Given a small cluster of 5, the easiest would likely be to run mon/mds/mgr on all 5 too. If the cluster gets larger you may want to keep that to just a few of the nodes though.

I don’t really understand the need for this or how it would integrate with LXD?

stgraber · April 16, 2020, 5:01pm

You first need to setup a working ceph cluster, that’s completely outside of LXD.
Once that works, the post you’re referring to is a good starting point to start using it with LXD.

A Ceph cluster needs ceph-osd, ceph-mon, ceph-mgr and ceph-mds setup on the various nodes to function. Once that’s all done and ceph status and ceph osd status both show no error, all disks in the cluster and things look otherwise healthy, then you can integrate that with your LXD cluster.

So I suspect your next step is to find a suitable Ceph tutorial and deploy that, before attacking the LXD side of things. ceph-deployer does make setting up a Ceph cluster reasonably easy these days.

tuathano · April 17, 2020, 2:43pm

Thanks a lot!

One thing I’m wondering about is container orchestration: i.e. if one of my physical nodes fails (or just some of the hardware (e.g. storage) ) in there, what happens to the OS containers running on there? 1. Is it possible to manual (or script) move and continue from the last point of container operation or 2. automatically move and continue the container elsewhere (although I’ve seen a post from 2019 which indicates this isn’t possible).If not, it seems to me that the storage failover (provided by CEPH integration) is the main HA component.

OK I’ve read:

But how does that work in practice?

I would envisage creating lots of Linux containers (compute nodes) on each physical node. Is there a rule of thumb for provisioning resources for each container and how much resource to keep in reserve on each physical node?

Finally I would need to have some specific software (on each compute node) to do some calculations. One way I thought I could do this was to create an initial lxc/d container , install the software I need in that and then clone that container and give it a new name/network settings and start it as a new machine (and repeat) or is there a better way to do this?

I was then thinking of using some kind of job scheduler (e.g. SLURM) to send of jobs to available containers in the cluster but perhaps it it’s best to actually run up a new container for each new job and do it that way.