Profile slow LXD container creation

smithsonian · November 26, 2019, 2:31pm

I am creating 1000 LXD containers. I have noticed that as the number of containers increase, the creation and start time increases as well.

How do I profile where the bottleneck is during creation and starting phase?

simos · November 26, 2019, 2:50pm

Hi!

I suggest to try with 900 LXD containers. There are some Linux kernel limits that may make it more complicated to diagnose what’s going on. You can have a look at production suggestions at https://github.com/lxc/lxd/blob/master/doc/production-setup.md because when you create so many containers, you may end up hitting soft and hard limits for resources on your server.

Note that when you launch a LXD container, LXD will set it up and then start it. When LXD starts a container, the container runs on its own. This means that lxc launch ubuntu:b mycontainer1 may finish in a few seconds but in reality the moment you get back your shell prompt, the container runtime just started booting up. This means that as soon as LXD launched 100 containers and you got back your shell prompt, your host has just started booting up each one of them.

What you are experiencing as a bottleneck, is the backlog of the many containers booting up.

There is something called shiftfs. See Trying out `shiftfs` It helps a lot if you enable it, because without it, it takes quite more time for each container to setup (therefore the backlog becomes bigger).

Also, depending on the container image that you are using, you may hit the memory limit of the host. If you get too many containers running, and there is not enough memory, you may crash the host.

smithsonian · November 26, 2019, 2:57pm

Hi Simos, thanks very much for your reply.

I included the production suggestions you linked and added them so I already have that. I think I got them from some other post of yours I also decided to use Open vSwitch to not hit the limit of 1024 ports per Linux bridge.

My system specs are 16 core with 60 GB of RAM so I am trying to scale it for a project to see how many containers I can run on the host. I limited the container to use 256MB each so I think I am not running the memory limit as well. The containers are starting except that the startup time is getting slower. I also used a tmpfs as a storage pool to try to speed up things.

I will try shifts. Thanks.

stgraber · November 26, 2019, 3:02pm

Also are you using lxc launch or lxc init?

If using lxc launch, the containers starting up in the background will use more and more CPU, slowing things down. lxc init doesn’t start the containers so creation speed should be more constant, with things slowing down when you then lxc start them.

It’s also pretty dependent on your image. Many distributions do a lot of costly work on initial boot, then go idle and don’t use much resources at all. It’s not uncommon to have your load average climb to over 100 when spawning hundreds of containers, then once they’re all done with their initial boot, it goes back down to 0 because they’re not doing anything and there is no more cpu/io pressure on the system.

smithsonian · November 26, 2019, 3:05pm

Hi Stéphane,

I am using lxc launch and launching what I think are lightweight alpine images. I will try init as well. Thanks.

My CPU load looks fairly normal so I am not sure if it is CPU. I will also try debugging Open vSwitch - maybe that’s where the bottleneck is?

That’s why I am looking to debug this to find out what causes the slowdown. The first few hundred containers are very quick to start (launch) but after that it takes some time.

smithsonian · November 26, 2019, 3:28pm

Just to share some actual numbers: the first 200 containers took .6 seconds to init (lxc init). At 400, it has increased to 1 second. At 500, 1.4 seconds.

These numbers are not bad I think so lxc init made a huge difference.