Delay in enforcing limits.cpu

ItsMattL · July 31, 2021, 7:22pm

I have an Ubuntu container with, among other things limits.cpu: “1”. Normally this has the intended effect: logging into the container shows one CPU at any given time. However, I’ve also observed multiple times that immediately after launching, the container will momentarily display all (40) host CPUs. I’ve seen this manifest multiple ways, including:

Immediately opening a shell and running htop inside the container
Inspecting log output from the apps inside the container that start immediately

The limit eventually kicks in, but seemingly not always before apps have started up and “assumed” there to be 40 cores available - which can be problematic as it causes them to way over-provision themselves for the resources that are actually available.

Is this expected behavior?

# Base information
 - Distribution: Ubuntu
 - Distribution version: 20.04.2 LTS (Focal Fossa)
 - Kernel version: Linux fusion 5.8.0-63-generic #71~20.04.1-Ubuntu SMP Thu Jul 15 17:46:08 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
 - LXC version: 4.16
 - LXD version: 4.16
 - Snap revision: 21039

stgraber · August 1, 2021, 2:13am

You’re correct, this is indeed how LXD applies the cpuset limits.

Basically there is no such thing as giving “one core” at the Linux level.
All that the kernel supports is either CPU time limits (CFS quotas or shares) or specific pinning to a given CPU core/thread.

When you set limits.cpu: 1, this gets converted to a pinning rule, but because we don’t want to pin all instances to the same CPU/thread, LXD operates a pretty simplistic scheduler to handle this.

So at container startup, no limit is actually applied, instead, as soon as the container is running (PID 1 spawned), a request is placed with the scheduler to re-balance all containers on the system. Instances starting/stopping as well as CPUs being physically added or removed all trigger the scheduler.

The scheduler then goes over every instance on the system, looks at whether they have a specific pin provided by the user or a request for a number of core/threads. It then looks at what the total number of CPU core/threads is on the system and re-spreads all instances across those, updating the pinning for each of them.

There are two main ways to avoid this behavior:

Use a specific pinning (effectively schedule things yourself and set limits.cpu to a CPU range rather than a number of CPUs)
Don’t use pinning at all and instead expose all CPU cores to every containers and restrict the amount of CPU time through limits.cpu.allowance. This is the most effecient as far as resource usage as the kernel will be in charge of balancing across all CPUs and can take a lot more differences in to account. However some userspace software doesn’t like seeing 40 CPU cores but then only getting about 4 cores worth of CPU time