DAC_OVERRIDE capability not working in unprivileged container

Hi all!
I’m working on some changes on the snap-confine program, but since a couple of days I’m stuck with a problem that I fail to understand: snap-confine needs to create a directory under /sys/fs/cgroup/freezer/, but the mkdirat syscall fails with EACCESS:

cannot create cgroup hierarchy /sys/fs/cgroup/freezer/snap.test-snapd-sh: Permission denied

Other information that might be relevant:

  1. snap-confine is setuid root: when it starts, in my branch I drop to the ordinary user but I retain a few capabilities, including DAC_OVERRIDE.
  2. Even modifying the program to not drop to the ordinary user, and continue being setuid root (and cap_to_text() shows =ep, that is all caps are effective) does not help, I get the same error.
  3. /sys/fs/cgroup/freezer/ has this (weird, IMHO) ownership:
    # lxd.lxc exec my-ubuntu -- ls -ld /sys/fs/cgroup/freezer/
    drwxrwxr-x 3 nobody root 0 Nov 17 07:09 /sys/fs/cgroup/freezer/```
    
  4. No apparmor or seccomp denials are visible in the logs
  5. snap-confine is executed as an ordinary user.
  6. The test code (where you can see how the machine is initialized) is here.
  7. The command that is failing is on line 146:
    lxd.lxc exec my-ubuntu -- su -l ubuntu -c "/snap/bin/test-snapd-sh.sh -c 'echo from-the-inside'"
    
    (snap-confine is executed as part of the /snap/bin/test-snapd-sh.sh command).

The reason why it’s currently working in snapd master branch is that before doing that mkdirat call we are changing our effective group to be root as well. Then it matches the group set on the /sys/fs/cgroup/freezer/ so it has permissions to create child items according to the DAC. But this should not be needed, since we have DAC_OVERRIDE.

So, the question is, why can’t we create a directory under /sys/fs/cgroup/freezer/ even being root or having DAC_OVERRIDE?

Any ideas @stgraber ?

ls -lh ls -lh /root@shell01:~# ls -lh /sys/fs/cgroup/
total 0
drwxrwxr-x 5 nobody root  0 Nov 15 21:28 blkio
lrwxrwxrwx 1 root   root 11 Nov 14 04:27 cpu -> cpu,cpuacct
drwxrwxr-x 5 nobody root  0 Nov 15 21:28 cpu,cpuacct
lrwxrwxrwx 1 root   root 11 Nov 14 04:27 cpuacct -> cpu,cpuacct
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 cpuset
drwxrwxr-x 5 nobody root  0 Nov 15 21:28 devices
drwxrwxr-x 3 nobody root  0 Nov 14 04:27 freezer
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 hugetlb
drwxrwxr-x 5 nobody root  0 Nov 15 21:28 memory
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 misc
lrwxrwxrwx 1 root   root 16 Nov 14 04:27 net_cls -> net_cls,net_prio
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 net_cls,net_prio
lrwxrwxrwx 1 root   root 16 Nov 14 04:27 net_prio -> net_cls,net_prio
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 perf_event
drwxrwxr-x 5 nobody root  0 Nov 15 21:28 pids
drwxrwxr-x 2 nobody root  0 Nov 14 04:27 rdma
drwxrwxr-x 5 nobody root  0 Nov 14 04:27 systemd
drwxrwxr-x 6 nobody root  0 Nov 16 02:54 unified
root@shell01:~# ls -lh /sys/fs/cgroup/freezer/
total 0
-rw-r--r-- 1 nobody nogroup 0 Nov 18 16:11 cgroup.clone_children
-rw-rw-r-- 1 nobody root    0 Nov 14 04:27 cgroup.procs
-r--r--r-- 1 nobody nogroup 0 Nov 18 16:11 freezer.parent_freezing
-r--r--r-- 1 nobody nogroup 0 Nov 18 16:11 freezer.self_freezing
-rw-r--r-- 1 nobody nogroup 0 Nov 18 09:56 freezer.state
-rw-r--r-- 1 nobody nogroup 0 Nov 18 16:11 notify_on_release
drwxr-xr-x 2 root   root    0 Nov 14 04:27 snap.lxd
-rw-rw-r-- 1 nobody root    0 Nov 14 04:27 tasks
root@shell01:~# mkdir /sys/fs/cgroup/freezer/blah
root@shell01:~# ls -lh /sys/fs/cgroup/freezer/
total 0
drwxr-xr-x 2 root   root    0 Nov 18 16:11 blah
-rw-r--r-- 1 nobody nogroup 0 Nov 18 16:11 cgroup.clone_children
-rw-rw-r-- 1 nobody root    0 Nov 14 04:27 cgroup.procs
-r--r--r-- 1 nobody nogroup 0 Nov 18 16:11 freezer.parent_freezing
-r--r--r-- 1 nobody nogroup 0 Nov 18 16:11 freezer.self_freezing
-rw-r--r-- 1 nobody nogroup 0 Nov 18 09:56 freezer.state
-rw-r--r-- 1 nobody nogroup 0 Nov 18 16:11 notify_on_release
drwxr-xr-x 2 root   root    0 Nov 14 04:27 snap.lxd
-rw-rw-r-- 1 nobody root    0 Nov 14 04:27 tasks
root@shell01:~# 

Seems like I can create a subdirectory of the freezer controller just fine here?

Yes, but that’s because you are root. In my case I’m running a setuid program. Please try this:

apt install -yq gcc
cat >> my-mkdir.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>

int main(int argc, char **argv)
{
    int rc = mkdir("/sys/fs/cgroup/freezer/my-dir", 0755);
    if (rc < 0) {
        perror("mkdir failed");
        return EXIT_FAILURE;
    }
    puts("directory created");
    return EXIT_SUCCESS;
}
EOF
gcc -o my-mkdir my-mkdir.c
chmod u+s my-mkdir
mv my-mkdir /home/ubuntu/
su -l ubuntu -c ./my-mkdir

It will fail with permission denied. Note that the effective user when running this program will be root, and as such it will have CAP_DAC_OVERRIDE set (I can also make another test where I renounce the root effective user as well and just retain the CAP_DAC_OVERRIDE capability, if you’d rather see that). But the directory creation fails because the group does not match; yet, this operation succeeds, when you are in the host machine.

Hi @mardy !

Please, try to add printf("%d %d %d\n", getpid(), geteuid(), getuid()); in the beginning of main function and ensure that geteuid() shows 0. I’ve tried to play with that reproducer on my system and noticed that from the user namespace without root mapping you will not get effective UID to be 0. This may be the reason for the behavior you are seeing.

Hi Aleksandr! Here I do get

root@my-ubuntu:~# su -l ubuntu -c ./my-mkdir
7900 0 1000
mkdir failed: Permission denied

So, the effective ID is root, yet the directory creation fails.

Hi, Alberto!

Are you executing this command from the host or from inside the container?

I’m executing it from the container. If I execute it from the host, it works.

1 Like

Ah, okay. Couldn’t you show your container configuration? lxc config show CT_name -e

Here it is:

# lxc config show my-ubuntu -e
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20221115.1)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20221115.1"
  image.type: squashfs
  image.version: "20.04"
  volatile.base_image: c8d5644eef2a1cc28b2070579c5d26f453ce6bea6d978d1b0c7bd7f1af69fbfd
  volatile.cloud-init.instance-id: b5c9ffcc-492f-4e8f-9e5b-290145960bfa
  volatile.eth0.host_name: veth3dea9f36
  volatile.eth0.hwaddr: 00:16:3e:e0:07:ce
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 85388290-ceea-4af4-933a-fb83756899f6
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

And here is how the container is created: snapd/task.yaml at 17bcb0311e95869c8de432fc5143b58d4fcdd104 · snapcore/snapd · GitHub