LXD equivalent of docker --cgroup-parent

Hello,

I have a server running Ubuntu-Server 20.04 with 8 cores. Cores 0-1 (system) are dedicated to os processes and 2-7 (user) are reserved (shielded) for user processes via cset. I would like to provision LXC containers via LXD and have them execute only on the “user” cpuset. With docker, that can be done by passing the option --cgroup-parent=/user (https://docs.docker.com/engine/reference/commandline/dockerd/#miscellaneous-options). What would be the LXD equivalent of that? I tried setting raw.lxc=lxc.cgroup.dir=/user in the configuration of my container but that doesn’t work: when I run:

cat /proc/cpuinfo

inside the container only cpus 0 and 1 show up, which means that the container is running under the system cgroup - and indeed, under /sys/fs/cgroup/cpuset/system/user I do see the monitor and payload of my container. So I tried manually setting the cpuset.cpus of the container group to 2-7 with raw.lxc=lxc.cgroup.cpuset.cpus=2-7 but that of course didn’t work either: the container would just fail to start. The error message in the log is:

> Name: test-container
> Location: none
> Remote: unix://
> Architecture: x86_64
> Created: 2020/12/14 16:26 UTC
> Status: Stopped
> Type: container
> Profiles: default
> 
> Log:
> 
> lxc test-container 20201214192010.525 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset/system/lxc.monitor.test-container"
> lxc test-container 20201214192010.526 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset/system/lxc.payload.test-container"
> lxc test-container 20201214192010.581 ERROR    cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits_legacy:2873 - Permission denied - Failed to set "cpuset.cpus" to "2-7"
> lxc test-container 20201214192010.581 ERROR    start - start.c:lxc_spawn:1741 - Failed to setup cgroup limits for container "test-container"
> lxc test-container 20201214192010.581 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:860 - Received container state "ABORTING" instead of "RUNNING"
> lxc test-container 20201214192010.583 ERROR    start - start.c:__lxc_start:1999 - Failed to spawn container "test-container"
> lxc test-container 20201214192010.583 WARN     start - start.c:lxc_abort:1013 - No such process - Failed to send SIGKILL via pidfd 31 for process 168790
> lxc 20201214192010.925 WARN     commands - commands.c:lxc_cmd_rsp_recv:126 - Connection reset by peer - Failed to receive response for command "get_state"

I suspect this has to do with the fact that the lxd daemon is running under system.

What is the best way to achieve that? The only relevant threads which I found are: What is the best way to use numactl or taskset and chrt in lxd which cpus are isolated from the host
How to allocate cores which are in isolcpus list of host
where the only solution seems to be to manually assign the cpus to the containers.

OS: Ubuntu Server 20.04
LXD version: 4.8

I’m not sure if this is what you’re after, but take a look at the CPU limits options available:

https://linuxcontainers.org/lxd/docs/master/instances#cpu-limits, specifically the limits.cpu setting.

That was the first thing that I tried. When I set:

limts.cpu="2-7"

in the configuration, only cores 0-1 are visible, which I do expect to happen since the containers are created under the system cgroup.

btw, this is my first post on this forum, so let me know if you need more detailed info.

@stgraber @brauner is this possible? Thanks

The limits.cpu setting won’t do what you want. What you want is for the container to use the /user cgroup, I assume the following example:

raw.lxc: |-
  lxc.cgroup.relative = 1
  lxc.cgroup.dir = /bla

This would cause the container to create separate cgroups for the monitor process [lxc monitor] and the container’s itself under /sys/fs/cgroup/<controller/user, e.g.:

  1. [lxc monitor]
12:blkio:/user.slice/bla/lxc.monitor.f8
11:perf_event:/bla/lxc.monitor.f8
10:hugetlb:/bla/lxc.monitor.f8
9:devices:/user.slice/bla/lxc.monitor.f8
8:cpu,cpuacct:/user.slice/bla/lxc.monitor.f8
7:freezer:/user/root/0/bla/lxc.monitor.f8
6:memory:/user/root/0/bla/lxc.monitor.f8
5:pids:/user.slice/user-1000.slice/session-1.scope/bla/lxc.monitor.f8
4:cpuset:/bla/lxc.monitor.f8
3:rdma:/bla/lxc.monitor.f8
2:net_cls,net_prio:/bla/lxc.monitor.f8
1:name=systemd:/user/root/0/bla/lxc.monitor.f8
0::/user.slice/user-1000.slice/session-1.scope/bla/lxc.monitor.f8
  1. container
12:blkio:/user.slice/bla/lxc.payload.f8
11:perf_event:/bla/lxc.payload.f8
10:hugetlb:/bla/lxc.payload.f8
9:devices:/user.slice/bla/lxc.payload.f8
8:cpu,cpuacct:/user.slice/bla/lxc.payload.f8
7:freezer:/user/root/0/bla/lxc.payload.f8
6:memory:/user/root/0/bla/lxc.payload.f8
5:pids:/user.slice/user-1000.slice/session-1.scope/bla/lxc.payload.f8
4:cpuset:/bla/lxc.payload.f8
3:rdma:/bla/lxc.payload.f8
2:net_cls,net_prio:/bla/lxc.payload.f8
1:name=systemd:/user/root/0/bla/lxc.payload.f8/init.scope
0::/user.slice/user-1000.slice/session-1.scope/bla/lxc.payload.f8/init.scope
3 Likes

Thanks a lot and sorry for the late reply.

That works (sometimes), but not how I would like to and not quite reliably.

I added the options you specified to the profile used by my containers. Here is what the profile looks like:

config:
  raw.lxc: |-
    lxc.cgroup.relative=1
    lxc.cgroup.dir=/user
description: A profile
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: base
used_by: []

At the beginning, the containers cgroups were created under /system/user, so I deleted the cgroup /system/user and all of its children, rebooted, and re-applied the cset configuration on startup. This is the ouput of cset set:

giu@server:~$ cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path


     root        0-7 y       0 y   337    2 /
     user        2-7 n       0 n     0    0 /user
   system        0-1 n       0 n    74    0 /system

Then I spawned a container with the new profile with:

lxc launch ubuntu:focal a-container --profile base

This changed the cset configuration:

cset:
Name CPUs-X MEMs-X Tasks Subs Path


     root        0-7 y       0 y   359    2 /
     user        0-7 n       0 n     0    2 /user
   system        0-1 n       0 n   106    0 /system

And didn’t achieve the desired effect:

giu@server:~$ lxc shell a-container 
root@a-container:~# cat /proc/cpuinfo | grep "core id"
core id		: 0
core id		: 1

So I created another container, this time specifying the option limits.cpu: 2-7. This kinda worked:

root@another-container:~# cat /proc/cpuinfo | grep “core id”
core id : 0
core id : 1
core id : 2
core id : 3
core id : 4
core id : 5
core id : 6
core id : 7

But that’s still not quite what I want as cores 0-1 shouldn’t be used by the container (I specified limits.cpu = 2-7). So I deleted all the containers and tried to re-apply the cset settings with the following script:

cset set --set system --cpu=0-1
cset set --set user --cpu=0-7
cset proc -m -f root -t system -k

The lxc.pivot cgroup under /user would cause that to fail, so I deleted it with

sudo cgdelete cpuset:/user/lxc.pivot

and re-ran the previous script. The cset config was now correct, but all the new containers cgroups would now be created under /system/user again.

Any ideas? Thanks in advance for your help.

P.S.: somehow that work as expected on a similar system (same hardware same OS), but no clue what the difference was.

Can you show the output of:
/proc/<container-init-pid>/cgroup and /proc/<container-monitor-pid>/cgroup, please?

Sorry for the late reply, I forgot about this after Christmas.

So, I tried to reproduce the issue once again. After a fresh boot I ran the following commands to create the cpusets and move all the kernel threads to the system cpuset:

sudo cset set --set system --cpu=0-1
sudo cset set --set user --cpu=2-7
sudo cset proc -m -f root -t system --kthread --force

Next I spawned a container with the profile described above: again, instead of creating the container’s cgroup under /user, it creates it under /system/user, which is created automatically by lxc.
The output of /proc/<container-init-pid>/cgroup is:

12:hugetlb:/user/lxc.payload.test
11:net_cls,net_prio:/user/lxc.payload.test
10:freezer:/user/lxc.payload.test
9:rdma:/user/lxc.payload.test
8:cpu,cpuacct:/user/lxc.payload.test
7:memory:/user/lxc.payload.test/init.scope
6:devices:/user/lxc.payload.test/init.scope
5:cpuset:/system/user/lxc.payload.test
4:pids:/user/lxc.payload.test/init.scope
3:perf_event:/user/lxc.payload.test
2:blkio:/user/lxc.payload.test
1:name=systemd:/user/lxc.payload.test/init.scope
0::/user/lxc.payload.test/init.scope

The output of /proc/<container-monitor-pid>/cgroup is:

12:hugetlb:/user/lxc.monitor.test
11:net_cls,net_prio:/user/lxc.monitor.test
10:freezer:/user/lxc.monitor.test
9:rdma:/user/lxc.monitor.test
8:cpu,cpuacct:/user/lxc.monitor.test
7:memory:/user/lxc.monitor.test
6:devices:/user/lxc.monitor.test
5:cpuset:/system/user/lxc.monitor.test
4:pids:/user/lxc.monitor.test
3:perf_event:/user/lxc.monitor.test
2:blkio:/user/lxc.monitor.test
1:name=systemd:/user/lxc.monitor.test
0::/user/lxc.monitor.test

So everything under /user except for cpuset.

Then, instead of using the commands above to create the cpusets, I tried just

sudo cset set --set user --cpu=2-7

and then spawned the container. This time it worked as expected! So I suppose that moving all the processes and unbound kernel threads to system with:

sudo cset proc -m -f root -t system --kthread --force

or (equivalently):

sudo cset shield --cpu=2-7

messes things up. Unfortunately, for my use case I really need to use cset shield as I need cores 2-7 to be completely isolated.

@brauner I made two test containers with:
(Thanks to @simos How to add multi-line raw.lxc configuration to LXD – Mi blog lah!)

raw.lxc: |-
  lxc.cgroup.relative = 1
  lxc.cgroup.dir = testing # The starting / does not work, as described above

Then I ran (from cgroup-tools):

cgset -r cpuset.cpus=0-1 testing

Inside the containers I can still see all 4 cpus though:

$ cgget -r cpuset.cpus -r cpuset.cpus.effective testing/lxc.payload.test

testing/lxc.payload.test:
cpuset.cpus: 0-3
cpuset.cpus.effective: 0-1

How is that possible?
The cgroup-v2 docs state (linux/Documentation/admin-guide/cgroup-v2.rst at master · torvalds/linux · GitHub):

cpuset.cpus
It lists the requested CPUs to be used by tasks within this cgroup. The actual list of CPUs to be granted, however, is subjected to constraints imposed by its parent and can differ from the requested CPUs.

I am trying to implement a workaround for what I described here: