Thanks a lot!
One thing I’m wondering about is container orchestration: i.e. if one of my physical nodes fails (or just some of the hardware (e.g. storage) ) in there, what happens to the OS containers running on there? 1. Is it possible to manual (or script) move and continue from the last point of container operation or 2. automatically move and continue the container elsewhere (although I’ve seen a post from 2019 which indicates this isn’t possible).If not, it seems to me that the storage failover (provided by CEPH integration) is the main HA component.
OK I’ve read:
But how does that work in practice?
I would envisage creating lots of Linux containers (compute nodes) on each physical node. Is there a rule of thumb for provisioning resources for each container and how much resource to keep in reserve on each physical node?
Finally I would need to have some specific software (on each compute node) to do some calculations. One way I thought I could do this was to create an initial lxc/d container , install the software I need in that and then clone that container and give it a new name/network settings and start it as a new machine (and repeat) or is there a better way to do this?
I was then thinking of using some kind of job scheduler (e.g. SLURM) to send of jobs to available containers in the cluster but perhaps it it’s best to actually run up a new container for each new job and do it that way.