Debian 12 / Kernel 6.1.0-11: Behavior of security.idmap.isolated

TCB13 · September 5, 2023, 11:14pm

Debian 12.1 (6.1.0-11-amd64) running LXD/LXC and on an unprivileged container setting security.idmap.isolated=true seems to fail to update the owner/group of the container’s files.

Here is an example:

# lxc launch images:debian/12 debian
(...)

# lxc config get debian volatile.idmap.base
296608

# lxc stop debian
Error: The instance is already stopped

# lxc config set debian security.idmap.isolated true

# lxc config get debian security.idmap.isolated
true

# lxc start debian

Now if I list the files on the container volume I’ll get they’re all owned by the host root user:

# ls -la /mnt/NVME1/lxd/containers/debian/rootfs/
total 24
drwxr-xr-x 1 root   root  154 Sep  5 06:28 .
d--x------ 1 296608 root   78 Sep  5 15:59 ..
lrwxrwxrwx 1 root   root    7 Sep  5 06:25 bin -> usr/bin
drwxr-xr-x 1 root   root    0 Jul 14 17:00 boot
drwxr-xr-x 1 root   root    0 Sep  5 06:28 dev
drwxr-xr-x 1 root   root 1570 Sep  5 06:28 etc

I tried multiple versions of LXD/LXC. This happens with both 5.0.2 from apt as well with 4.0 and 5.17 (latest) from snap.

Interestingly enough I have another Debian 10 (4.19.0-25-amd64) running and older LXD 4 from snap and on that one things work as expected:

# ls -la /mnt/NVME1/lxd/containers/debian/rootfs/
total 0
drwxr-xr-x 1 1065536 1065536  138 Oct 29  2020 .
d--x------ 1 1065536 root      78 Oct 14  2020 ..
drwxr-xr-x 1 1065536 1065536 1328 Jul 24 19:07 bin
drwxr-xr-x 1 1065536 1065536    0 Sep 19  2020 boot
drwxr-xr-x 1 1065536 1065536    0 Oct 14  2020 dev
drwxr-xr-x 1 1065536 1065536 1716 Jul 24 19:08 etc

As you can see on this systems all the files are owned by 1065536:1065536.

Update:

I tried to probe around the maps with lxc config show debian in both machines and I saw this:

Machine running Debian 10:

security.idmap.isolated: "true"
(...)
volatile.idmap.base: "1065536"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1065536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1065536,"Nsid":0,"Maprange":65536}]'

Machine running Debian 12:

security.idmap.isolated: "true"
(...)
volatile.idmap.base: "231072"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":231072,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":231072,"Nsid":0,"Maprange":65536}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":231072,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":231072,"Nsid":0,"Maprange":65536}]'
volatile.last_state.idmap: '[]'

Update:

I also tried a fresh install of Debian 11 (5.10.0-25-amd64) and it works as expected:

root@vm-debian-11-cli:~# ls -la /mnt/NVME1/lxd/containers/debian/rootfs/
total 24
drwxr-xr-x 1 1065536 1065536  154 Sep  6 06:28 .
d--x------ 1 1065536 root      78 Sep  6 15:31 ..
lrwxrwxrwx 1 1065536 1065536    7 Sep  6 06:25 bin -> usr/bin
drwxr-xr-x 1 1065536 1065536    0 Jul 14 17:00 boot
drwxr-xr-x 1 1065536 1065536    0 Sep  6 06:28 dev
drwxr-xr-x 1 1065536 1065536 1570 Sep  6 06:28 etc

Why didn’t it populate volatile.last_state.idmap: '[]'? As with works with both Debian 10 and 11 apparently this can be related to the new kernel and/or its configuration.

The only logs I get in de Debian 12 are:

-- Boot 337145edcc8f491e80559f44887f3e5e --
Sep 06 15:46:30 vm-debian-12-cli systemd[1]: Starting lxd.service - LXD Container Hypervisor...
Sep 06 15:46:30 vm-debian-12-cli lxd[796]: time="2023-09-06T15:46:30+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
Sep 06 15:46:30 vm-debian-12-cli lxd[796]: time="2023-09-06T15:46:30+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
Sep 06 15:46:30 vm-debian-12-cli lxd[796]: time="2023-09-06T15:46:30+01:00" level=warning msg="Instance type not operational" driver=qemu err="QEMU command not available for CPU architecture" typ>
Sep 06 15:46:32 vm-debian-12-cli systemd[1]: Started lxd.service - LXD Container Hypervisor.´

How can I fix it? Thank you.

stgraber · September 7, 2023, 3:04pm

So what’s the problem exactly?

Prior to VFS idmap being available, we needed to work around file ownership by having LXD manually rewrite the owner of every single file on disk. That’s what you’re showing here on an older kernel.

On newer kernels, this is no longer needed as we can have the kernel keep the permissions on-disk unshifted and just shift in-kernel so the ownership looks correct inside of the container.

What you’re showing above looks like a perfectly working setup on a kernel that does support VFS idmap.

stgraber · September 7, 2023, 3:06pm

To make sure everything is correct, on such a kernel with security.idmap.isolated=true you should see:

/var/lib/lxd/storage-pools/POOL/containers/NAME/rootfs is unshifted (most files belong to root:root)
/ and sub-directories ownership as seen from inside the container also shows up as mostly owned by root:root
cat /proc/self/uid_map from inside the container shows a map of 65536 uid/gid which is different (isolated) from your other containers

TCB13 · September 7, 2023, 3:53pm

As I understand this operation only happen once when the container was started. Am I correct? If so, isn’t VFS idmap more resource intensive while the container is running than simply running the ownership fix once?

Apparently from what I saw on the other thread the root mount point is indeed idmapped:

root@debian:~# cat /proc/self/uid_map
         0     231072      65536

root@debian:~# cat /proc/self/gid_map
         0     231072      65536

root@debian:~# cat /proc/self/mountinfo
490 460 0:24 /@rootfs/mnt/NVME1/lxd/containers/debian/rootfs / rw,relatime,idmapped shared:251 master:1 - btrfs /dev/sda1 rw,space_cache=v2,user_subvol_rm_allowed,subvolid=259,subvol=/@rootfs/mnt/NVME1/lxd/containers/debian

On the “host” machine:

root@vm-debian-12-cli:~# lxc info | grep 'shift\|idmap'
- container_protection_shift
- container_disk_shift
- storage_shifted
    idmapped_mounts: "true"
    shiftfs: "false"
    idmapped_mounts_v2: "true"

Either way, setting those shouldn’t make a container operate as if there wasn’t any VFS idmap features on the kernel?

lxc config set debian security.idmap.isolated true
lxc config set debian security.protection.shift true

Thank you.

stgraber · September 7, 2023, 4:38pm

Okay, so your system is operating perfectly normally and with the lowest overhead possible right now, nothing to be worried about.

The old pre-start shifting method was very slow and very risky as a crash or failure to shift a particular bit of metadata (ACL, xattr, …) could allow for a security issue with the container. It was also horrible for CoW filesystems as it effectively made it look like every single file in the container had been modified, potentially duplicating GBs of data.

shiftfs (which was an Ubuntu-specific hack) and now the proper VFS idmap shifting, simply have the kernel apply the reverse uidmap/gidmap on any filesystem operation to a mount that’s marked as idmapped. It’s an extremely trivial operation to perform, allows for dynamic changes to the container maps (very useful for isolated), allows for sharing data between containers and properly supports everything that can hold a uid/gid (ioctl, xattr, acl, …) so doing away with the risk of having missed something.

TCB13 · September 9, 2023, 10:24am

Okay, thank your the detailed explanation. I’ll stick with VFS idmap shifting because if the performance is not affected nor security I don’t have anything against it.

I took some time to further explore the documentation one more time and this caught my attention:

Containers with security.idmap.isolated will have a unique ID range computed for them among the other containers with security.idmap.isolated set (if none is available, setting this key will simply fail).

If data sharing between containers isn’t needed, you can enable security.idmap.isolated (see Instance configuration), which will use non-overlapping uid/gid maps for each container, preventing potential DoS attacks on other containers.

Considering I have a profile/container with those:

root@vm-debian-12-cli:~# lxc profile show default
config:
  limits.cpu: "2"
  limits.memory: 3GB
  security.idmap.isolated: "true"

root@vm-debian-12-cli:~# lxc config get debian security.idmap.isolated
true

It doesn’t seem to be working in my case, because once I create a user inside the container there’s overlap with an host’s user:

root@debian:~# adduser tcb13container --no-create-home --disabled-password --gecos GECOS
Adding user `tcb13container' ...
Adding new group `tcb13container' (1000) ...
Adding new user `tcb13container' (1000) with group `tcb13container (1000)' ...
Not creating home directory `/home/tcb13container'.
Adding new user `tcb13container' to supplemental / extra groups `users' ...
Adding user `tcb13container' to group `users' ...
root@debian:~# touch test-tcb13container
root@debian:~# chown tcb13container: test-tcb13container
root@debian:~# ls -la
total 12
(...)
-rw-r--r-- 1 tcb13container tcb13container   0 Sep  7 17:01 test-tcb13container

It created the user tcb13container with the GID 1000. Now outside the container I see this:

root@vm-debian-12-cli:~# ls -la /mnt/NVME1/lxd/containers/debian/rootfs/root/
total 12
(...)
-rw-r--r-- 1 tcb13 tcb13   0 Sep  7 18:01 test-tcb13container

root@vm-debian-12-cli:~# id -g tcb13
1000

The file outside is owned by the host’s tcb13 that has the GID 1000 on the host.

If this the shifting taking place as well? Before this shifting I believe users created inside the container would end up with much higher numbers than the volatile.idmap.base value set for the container.

Can it be configured to only use higher GID / avoid smashing into host’s users? What happens security-wise if the user on the container manages to escape it? Can it execute things with the host’s user that overlaps?

Thank you once again.

stgraber · September 9, 2023, 5:01pm

That is not what security.idmap.isolated does.

What security.idmap.isolated does is make it so any kernel resources owned by uid 1000 in container1 is mapped to a different kernel uid than kernel resources owned by uid 1000 in container2.

If you run a process as uid 1000 in container1 and another process as uid 1000 in container2 and then go look at ps fauxww from the host, you’ll see that they’re running as different real users.

TCB13 · September 9, 2023, 6:18pm

Yes, I’ve noticed that.

But what about the file ownership? Isn’t it dangerous to have an user inside the container being able to write files that are owned by an host’s user?

Thank you for the patience and the detailed info.

stgraber · September 9, 2023, 6:19pm

It’s only dangerous if the path could be accessed by a user on the host other than the root user, which we avoid by making the path impossible to traverse.

Albert_Valiev · October 29, 2023, 2:18pm

This setup was used in CTF challenge. And if you have root on container, ordinary user on hypervisor host, dir storage backend and shared folder accessible by ordinary user on host created with “lxc config device add mycontainer backup disk source=/backup path=/backup” - you will have root on host in few seconds and two suid binaries so this is worthy of CVE

stgraber · October 29, 2023, 5:19pm

Not CVE worthy, you’re effectively getting exactly what you asked for in this scenario.

For volumes that are managed by Incus itself, they are stored in a filesystem tree which is not traversable by unprivileged users on the host, specifically to avoid this kind of thing.

Alternatively you could have also made the path on the host be mounted as nosuid or noexec to avoid potential execution of suid binaries written by a container that you’re sharing this with.

Albert_Valiev · October 29, 2023, 6:00pm

Well maybe not CVE worthy, but worth mentioning that this is obviously default scenario of mounting dir from host into container. Thus introducing possible vulnerability. And we all know that defaults that are unsafe will be introduced to the systems sooner or later. Maybe it should be considered making this mounts “nosuid” by default or something.