I’m working on moving Ubuntu’s autopkgtest installation to using clustering for its armhf nodes, rather than manually managing remotes. The current setup looks like this:
A service running in a “production” cloud runs a number of instances of a program to receive requests and start lxd instances to run the requests. Instances are configured to send to a particular remote. Remotes are running in a different “workload” cloud. When a remote breaks, the admins manually delete it and provision a new one, then update the controller to know about this one rather than the old.
In the new world, instead of having the controller know about each backend instance, it would know about an LXD cluster - as a remote - and recovering from failure of a node is a matter of deleting the old one and running a script to deploy a new one to the cluster.
This feels nice, except for the “know about an LXD cluster” part. That’s a single point of failure: I will have to tell the system the address of one of the nodes, and if this one goes down then I need to reconfigure the remote to point to a different IP.
I suppose there are a couple of things I could do to mitigate this:
- Use a DNS name, and round robin amongst all the nodes (with health checking?)
- Set up haproxy and talk to this instead (would this work?)
but they are both kind of annoying, since I have to write glue and synchronise this external thing with the active list of nodes. So the question is whether this is something LXD could support directly. That is - when a remote is part of a cluster, regularly communicate the other addresses in the cluster and then fail over to another if we find that the one we’re talking to is dead now.