Scaling update?

@Stephane - any update on scaling since Scale testing LXD clustering?

I am particularly interested in something shallower, meaning more hosts (e.g., 100-1000), less containers per host (1-5).

With less containers per host, I would expect:

  • more network traffic on the wire
  • smaller spawn rate (longer time to spawn)

Is that your sense?

And, has the container spawn rate improved?

Any other comments would be welcome.

Thanks

We haven’t run the full test in a while but have been optimizing a lot of the cluster logic recently. There’s still one big piece to fixing past scalability issues which is the eventhub role that @tomp has been working on. We should have that in LXD 4.23 (mid-February).

Until then, the largest production cluster I’m running stands at 52 hosts and has been behaving pretty well.

With very large clusters, spawn time shouldn’t really increase when machines aren’t themselves particularly busy. I’d expect the biggest issue to become dealing with updates as LXD requires all machines to be on the exact same version. When dealing with hundreds of systems, it may take a while for all of them to detect that they need to pull an update and to reload, causing a significant downtime of the API (the instances themselves will be fine though).