Privileged cgroup v1 guest on v2 host can't mount cgroup

tike64 · November 30, 2023, 8:31am

Dear all,

I’m trying to run a privileged guest with systemd and v1 cgroups (Ubuntu Xenial) on a host with v2 cgroups (Debian 11 5.10.0-26-amd64). I’m semisure that this was possible earlier on older kernel by simply mounting tmpfs on /sys/fs/cgroup in the container. Now it leads to

Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted

My debugging boiled down to try unshare. I can mount the v1 cgroup outside the container:

~$ sudo mount -t cgroup -o none,name=systemd none mnt
~$ ls mnt
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs           notify_on_release     tasks

But not if I enter cgroup namespace:

~$ sudo unshare -C
/home/timo# mount -t cgroup -o none,name=systemd none mnt
mount: /home/timo/mnt: permission denied.

That works on an older kernel (5.4.66-0-lts):

~# unshare -C
~# mount -t cgroup -o none,name=systemd systemd cg
~# ls cg
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs           notify_on_release     tasks

Is this a kernel bug or intended behavior or am I on wrong tracks? What can I do to make my containers run again?

Unfortunately I don’t have access any more to my older workstation where the container worked.

stgraber · November 30, 2023, 3:18pm

Can you try doing that mount on the host, keeping it there and then making a new mount in the namespace?

I know that some kernel versions need to have a cgroup controller be setup by the root user on the host prior to its use in a container. Though in your case it appears your container is privileged which therefore shouldn’t really be affected by this.

tike64 · November 30, 2023, 7:45pm

Thanks for replying,

Yes, privileged container (edited my post).

I’m not sure, if I follow. I did:

~$ sudo mount -t cgroup -o none,name=systemd none cg
~$ ls cg
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs           notify_on_release     tasks
~$ sudo unshare -C
/home/timo# ls cg
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs	       notify_on_release     tasks
/home/timo# mount -t cgroup -o none,name=systemd none cg
mount: /home/timo/cg: none already mounted on /sys/fs/bpf.
/home/timo# ls cg
cgroup.clone_children  cgroup.sane_behavior  release_agent
cgroup.procs	       notify_on_release     tasks

The mount is seen also in the namespace but the mount command fails. I don’t know how would I try this with the container but I would expect systemd would also fail trying to mount cgroup.

The error message is very confusing: “already mounted on /sys/fs/bpf”.

tike64 · November 30, 2023, 8:05pm

Ok, I found out that if I unshare -Cm and umount the mount point in the namespace, then I can successfully do the mount in the namespace. Or I can mount into a different mount point.

Now I’m not sure if I know how to apply this in the container setup…

tike64 · November 30, 2023, 8:59pm

Hum, I write faster than think…

I mounted cgroup in the host just somewhere (/home/timo/cg) and the container started to work. Thanks a lot for steering me into the right track!

Now I wonder, how would I apply this knowledge in a ‘correct’ or ‘elegant’ way. Should this work out of the box? Is there something missing/wrong in my installation?

stgraber · November 30, 2023, 10:29pm

You’ll probably want some kind of init script that mounts name=systemd somewhere on the system to unblock things.

tike64 · December 1, 2023, 8:41am

I made a pre-start hook:

#!/bin/bash

P=/run/lxc/cgroup/systemd

mountpoint $P && exit 0

mkdir -p $P
mount -t cgroup -o none,name=systemd cgroup $P

It is working nicely now but I’m a little bit worried about container separation. When systemd in the container adds things into the cgroup tree, e.g. user.slice, they are visible in the host side too. Is this going to cause problems if I had multiple v1 containers? Is there a way to somehow confine the mounts?

tike64 · December 3, 2023, 5:09pm

I’m trying to study and learn as I’m writing.

I hooked the script into lxc.hook.start-host and tried to improve it a bit:

#!/bin/bash

P=/run/lxc/cgroup

if ! mountpoint $P; then
        mkdir -p $P && mount -t cgroup -o none,name=lxc cgroup $P || exit 1
fi

mkdir $P/$1
logger Starting $1 ppid:$PPID lxc_pid:$LXC_PID
echo $LXC_PID >$P/$1/tasks

Does that look like a sane approach? Can I check somehow that the containers are not interfering each other?