Scale testing LXD clustering

stgraber · April 9, 2019, 10:52pm

Introduction

Most LXD clusters we’re seeing right now have been under 10 members.

At that scale, we’re not seeing any real problem and we have seen such clusters host over 10000 containers just fine, keeping network chatter to an acceptable level and not getting too slow with that kind of load.

But to get a better sense of potential bottlenecks in LXD when it comes to growing the cluster, we want to test a much larger cluster, ideally one of up to 100 members with 100 containers on each for a total of 10000 containers.

Setup

For this scale testing, I’m using Google Compute instances, each with 4 vCPUs and 16GB of RAM. I’m attaching a local NVME drive to each which will run ZFS for local storage. This cluster doesn’t have clustered storage enabled.

For networking, I’m using the normal instance subnet for cluster communications, API clients and container networking. The containers are running on a Fan bridge, giving each cluster member capacity for up to 254 containers.

I manually installed the first cluster member, then manually installed the second, got the YAML output from lxd init and then scripted spawning more cluster members.

Part 1 (50 members)

The first try was to start 50 instances and get them all to cluster, which they did, despite having them all join at the exact same time:

+-------+----------------------------+----------+---------+-------------------------------------+
| NAME  |            URL             | DATABASE |  STATE  |               MESSAGE               |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd00 | https://10.138.0.61:8443   | YES      | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd01 | https://10.138.0.62:8443   | YES      | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd02 | https://10.138.0.63:8443   | YES      | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd03 | https://10.138.15.192:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd04 | https://10.138.15.193:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd05 | https://10.138.15.218:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd06 | https://10.138.15.202:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd07 | https://10.138.15.212:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd08 | https://10.138.15.210:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd09 | https://10.138.15.206:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd10 | https://10.138.15.198:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd11 | https://10.138.15.207:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd12 | https://10.138.15.209:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd13 | https://10.138.15.227:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd14 | https://10.138.15.199:8443 | NO       | ONLINE  | fully operational                   |+-------+----------------------------+----------+---------+-------------------------------------+
| lxd15 | https://10.138.15.195:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd16 | https://10.138.15.230:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd17 | https://10.138.15.234:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd18 | https://10.138.15.221:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd19 | https://10.138.15.232:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd20 | https://10.138.15.204:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd21 | https://10.138.15.233:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd22 | https://10.138.15.203:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd23 | https://10.138.15.215:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd24 | https://10.138.15.231:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd25 | https://10.138.15.219:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd26 | https://10.138.15.214:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd27 | https://10.138.15.220:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd28 | https://10.138.15.225:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd29 | https://10.138.15.229:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd30 | https://10.138.15.208:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd31 | https://10.138.15.223:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd32 | https://10.138.15.197:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd33 | https://10.138.15.211:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd34 | https://10.138.15.200:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd35 | https://10.138.15.205:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd36 | https://10.138.15.235:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd37 | https://10.138.15.196:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd38 | https://10.138.15.216:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd39 | https://10.138.15.213:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd40 | https://10.138.15.228:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd41 | https://10.138.15.238:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd42 | https://10.138.15.217:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd43 | https://10.138.15.194:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd44 | https://10.138.15.224:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd45 | https://10.138.15.236:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd46 | https://10.138.15.226:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd47 | https://10.138.15.222:8443 | NO       | ONLINE  | fully operational                   |+-------+----------------------------+----------+---------+-------------------------------------+
| lxd48 | https://10.138.15.201:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+
| lxd49 | https://10.138.15.237:8443 | NO       | ONLINE  | fully operational                   |
+-------+----------------------------+----------+---------+-------------------------------------+

At that point the API was very responsive and very little network traffic could be seen, mostly thanks to us not forwarding debug messages anymore, keeping event chatter to a minimum.

Spawning the container

I then created 100 projects on top of that to avoid lxc list getting too slow due to the number of containers.

After that, I started filling projects with 100 containers each running Alpine Linux.
The choice of Alpine here is simply because I don’t want any CPU or RAM stress and a simple busybox + DHCP is enough for our test.

Instance failures

One thing to note is that I used preemptible instances and so a number of cluster members will die during this test. When this happens, I will need to kick them out of the cluster by force and re-deploy them with my script, they’ll be back up in a couple minutes and will get new containers assigned to them soon after.

I also kept an eye on projects that weren’t at the target 100 containers as that indicates that some of the containers got lost due to lost cluster members, I manually re-created those. This wouldn’t have been a problem if using CEPH as the containers could just have been moved.

During this run, a total of 6 instances were destroyed and re-provisioned.

Observations while spawning

At this point, my goal was to reach 5000 containers on the cluster, reaching the target 100 containers per member.

The only thing I’ve seen during this try which may become a significant issue going to 100 cluster members is a steady increase in network bandwidth usage on the database leader, averaging around 200Mbit/s when reaching the 5000 containers mark. This is getting worse as more containers are added and database queries return more records and something that can certainly be improved by making LXD more conservative and avoid needless queries.

All other instances showed below 3Mb/s of traffic.

The new dqlite 1.0 database backend may also improve or at least change those figures due to a change in the network protocol.

Memory consumption on the database leader also went up quite a bit, up to 600MB of RES while the others were below 150MB.

Load average on the database leader was around 2 while below 0 on all the others. That was the goal by using Alpine containers which don’t do much at all at boot time, the goal is to test LXD, no the virtual hardware.

lxc cluster list and lxc cluster remove --force both performed perfectly, making it easy to detect broken instances, kick them out and replace them. LXD’s placement algorithm then started filling those instances back up as expected.

Observations at idle

After 5 hours spent to spawn all the containers, the cluster was finally running 5000 containers spread on 50 cluster members.

Load on the average cluster member was negligible with some reaching 0.00 while most were around 0.10.

LXD itself would be using around 150MB of RAM in average.
CPU consumption would show at about 5% which seems a bit high and warrants further investigation.
Network on the database leader went back down to just a few Mbit/s with occasional spikes in the tens of Mbit/s, much more reasonable than the 250Mb/s it was seen spiking at during container creation.

Some timings:

lxc cluster list: 0.1s
lxc list --project scale-00 --fast: 2s
lxc list --project scale-00: 2.5s
lxc operation list: 0.3s
lxc exec alp-10-10 --project scale-10 -- echo blah: 0.5s
lxc network list-leases lxdfan0: 100s

That last one is a worst case scenario as DHCP leases must be retrieved from every single cluster member by having them read their filesystem and send data back as well as correlating that data with container information from the database. It’s effectively the only command in my setup which actually lists all 5000 containers, their location and live information for them.

Part 2 (100 members)

The initial plan was to add another 50 instances at that point and spawn an additional 5000 containers, but the experience so far is already showing enough bottlenecks that spending 10 hours or more to get that many more containers is somewhat pointless.

Instead, I decided to just add the additional 50 instances but only spawn a single container per new instance just to keep them a bit busy.

An additional difficulty is that I hit the quota for fast local NVME storage on Google Compute so had to move those other 50 instances to using a loopback ZFS pool instead, not a huge issue given what they’ll be running but needed some changes to the scripting logic and something to keep in mind for the next test with 100 instances.

After a few minutes waiting for the instances to start, I got:

root@lxd00:~# lxc cluster list | grep ONLINE | wc -l
100

I then started creating an additional 50 containers which took almost 30 minutes.
This showed that not only does performance decrease with the number of overall containers (even when placed in projects) but the normal chatter between cluster members also causes slow downs.

With those created, I ran the same tests as before:

lxc cluster list: 0.1s (same)
lxc list --project scale-00 --fast: 2s (same)
lxc list --project scale-00: 3s (+0.5s)
lxc operation list: 0.4s (+0.1s)
lxc exec alp-10-10 --project scale-10 -- echo blah: 0.5s (same)
lxc network list-leases lxdfan0: 180s (+80s)

Looking at CPU and RAM usage, no significant change could be seen compared to the 50 members cluster from earlier, so memory is likely to increase based on container count more than cluster size.

Network wise, the spikes are now between 350 and 400Mbps so database traffic effectively accounts for almost half of the entire network connectivity of the leader.
CPU usage while doing that isn’t terribly high though at below 8% compared to 5% on the other instances.

Conclusion

First thing first, nothing really fell apart, with only a dozen containers or so which failed to create or start due to database locking errors, a re-try succeeded in all cases.

From past experience we know that we can run a cluster of around 10 members with over 10000 containers with reasonable performance and running a cluster of 50 members with 5000 containers didn’t feel slower for most operations once the containers were created.

A user on this cluster would get a shell in a running container in half a second, be able to list a project with 100 containers in a couple of seconds and only very few specific global API requests would ever end up hitting the entire set of cluster members.

The network and CPU usage on the leader is the main worry as we scale the number of cluster members and the bottlenecks appear to be mostly where we expected them to be.

It was also great to see just how well the cluster copes with some cluster members going away, then being forcefully removed from the cluster and added back as blank cluster members just to see LXD replicate images onto them and start placing containers on them.

Creation speed

It took about 5 hours to get to 5000 running containers with the spawn rate slowing down as the database chatter increased causing each query to return more and more data. This time does include a number of breaks to remove cluster members and replace them as well as a couple of mistakes in my scripts causing about 800 containers to land in the wrong projects and needing to be deleted and re-created.

This gives us a very non-impressive 0.27 containers/second.

Kernel improvements

It’s worth noting that I was using security.idmap.isolated=true and a 4.15 kernel so all containers had to be shifted on startup, switching to a 5.0 kernel with shiftfs would likely speed things up a bit.

Switching to a more recent 5.0 kernel would get us shiftfs and improved network interface data retrieval which would significantly cut down on container startup time and improve listing containers but with most of the time spent in container creation, I doubt we’d see much more than a 10% or so performance increase with that.

Database improvements

The much bigger performance boost would come from doing an analyze of our database queries for:

Container creation
Container startup
Idle daemon
Idle daemon with running containers

We have background tasks such as those dealing with container scheduling and device tracking which require pulling a complete list of all containers running on a particular cluster member. Those are the type of queries which can be very expensive and where we may be able to reduce their frequency, their number or reduce the size of the data being pulled from the database.

Cluster heartbeats

With 100 cluster members, those heartbeats being sent every second by all members can’t really be ignored and maybe we should consider decreasing the rate a bit to 5s or so.

Event handling improvements

Lastly, one thing we know we will need to improve is our event forwarding mechanism.
Right now each cluster member connects to all others to receive notifications.
With 100 members, this is a pretty substantial amount of TLS connections going on each coming with its associated goroutine.

Some possible options to improve this:

Have some members get selected as event hubs with individual daemons only connecting to one of those members to send and receive all events, those members would then interconnect to each other and relay as we do today. Which in the 100 members case would get us from 9900 total connections down to just 102 connections and with every hub only maintaining 30 connections or so.
Use the shared underlying network between the cluster members to send events on a multicast address. This would require custom encryption though as we can’t run our normal API over TLS on that.

stgraber · April 10, 2019, 7:21am

Short term we should be looking at:

Rate of heartbeats
Ability to have LXD send all DB queries to the debug log, likely through env variable
Use that ability to catch the worst offenders at idle, creation time and container startup

Then we can repeat the 50 nodes / 5000 containers test, hopefully by that point using the 5.0 HWE kernel too and see just how much faster things are.

My hope is that we can sustain at least 1 container/second rate at that point and can identify the next round of improvements, including measuring the impact of the event forwarding to know how to prioritize that work.

stgraber · April 10, 2019, 7:23am

The switch to dqlite 1.0 will also be something else that triggers a re-try to measure the CPU, RAM and network usage after the switch.

roka · April 11, 2019, 10:40am

How were you able to setup fan networking on google compute instances?

I tried it and it did not work as Google assigns unicast address to instances.
If its possible can you send me a little of your gcp configuration?
At present for my setup networking is setup independent of lxd bridge and setup using ansible scripts.

stgraber · April 11, 2019, 2:29pm

You need to manually provide the underlay subnet, in my case it was 10.138.0.0/16 which all my other instances were in.

roka · April 12, 2019, 2:13am

Will it be possible to share the scripts you used to setup gcs fan networking, at present I do it with my own ansible script with manually managed bridge (not lxd managed bridge).

stgraber · April 12, 2019, 3:00am

That part wasn’t scripted, I manually ran lxd init on the first node, which is where the fan networking was selected and the subnet manually entered.

All the other instances then picked up that setting when joining the cluster so there’s no mention of it in any of the scripts and cloud-init metadata that I used.

roka · April 19, 2019, 8:41am

I tried installing using lxd init with underlay network. It came with an error:

“Error: Failed to create network ‘lxdfan0’: No address found in subnet”

stgraber · April 22, 2019, 3:29am

That’s the error I’d expect if the underlay you provided doesn’t include your host’s address.

roka · April 23, 2019, 3:28am

ok I tried it and it works with giving the underlay. I will try to re-create my cluster using preseed and see if it works.
Just wanted to check with you if lxd init --dump still works. I see this has been removed from lxd init options.
Now the choice is to create preseed manually or by concatenating different commands using lxc info.

stgraber · April 23, 2019, 3:38am

--dump should work fine, though note that it cannot automatically fill cluster joining information as we can’t extract some of that needed data through the API.

If your goal is to join additional nodes through preseeding, the best is to do one manually, then at the end of the interactive lxd init, say yes to printing the preseed, then use that preseed for the others.