ID mapping doesn't work for files, container uses same UIDs/GIDs as host

Hello,
when I create a new container, all files in that container use the exact same UIDs and GIDs as the host, even when I enable idmap isolation:

# incus storage list
+---------+--------+--------------------------------------+-------------+---------+---------+
|  NAME   | DRIVER |                SOURCE                | DESCRIPTION | USED BY |  STATE  |
+---------+--------+--------------------------------------+-------------+---------+---------+
| default | dir    | /var/lib/incus/storage-pools/default |             | 10      | CREATED |
+---------+--------+--------------------------------------+-------------+---------+---------+

# cat /etc/subuid
incus:65536:1000000000
root:65536:1000000000

# cat /etc/subgid
incus:65536:1000000000
root:65536:1000000000

# incus init images:debian/12 test-new
Creating test-new
Retrieving image: Unpack: 100% (1.12GB/s)

# incus config set test-new security.idmap.isolated=true

# incus start test-new

# incus exec test-new bash
root@test-new:~# touch foo
root@test-new:~# ls -l foo
-rw-r--r-- 1 root   root 0 Jun 11 22:16 foo
root@test-new:~# 
exit

# ls -l /var/lib/incus/storage-pools/default/containers/test-new/rootfs/root/
total 0
-rw-r--r-- 1 root   root 0 Jun 11 22:16 foo

# incus config show test-new
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian bookworm amd64 (20240611_05:24)
  image.os: Debian
  image.release: bookworm
  image.serial: "20240611_05:24"
  image.type: squashfs
  image.variant: default
  security.idmap.isolated: "true"
  volatile.base_image: d02788aeece968be5715583bc59c3d2931d69010b68dbdce531924d1721febe8
  volatile.cloud-init.instance-id: 6f4cf812-d741-4f53-87a5-98ab20328b8b
  volatile.eth0.host_name: vethca4089a2
  volatile.eth0.hwaddr: 00:16:3e:f1:bc:a7
  volatile.idmap.base: "262144"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":262144,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":262144,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":262144,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":262144,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 4a7357d0-d9b0-45c6-a7b7-216bbf6546ce
  volatile.uuid.generation: 4a7357d0-d9b0-45c6-a7b7-216bbf6546ce
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

#

As you can see, the file foo belongs to root in the container, but also in the host filesystem.

Even stranger: for already existing containers, everything works as expected:

# incus exec nas bash
root@nas:~# touch foo
root@nas:~# ls -l foo
-rw-r--r-- 1 root root 0 Jun 12 07:31 foo
root@nas:~# 
exit

# ls -l /var/lib/incus/storage-pools/default/containers/nas/rootfs/root/foo  
-rw-r--r-- 1 131072 131072 0 Jun 12 09:31 /var/lib/incus/storage-pools/default/containers/nas/rootfs/root/foo

Here, the file foo doesn’t belong to root on the host fs, as expected.

This behaviour started on LXD after upgrading from Debian 11 to 12 and is still present now on Incus.

What am I doing wrong?

I think this is a bug in the dir driver, because if an unprivileged container can create priviliged (on the host) files, isn’t that a security issue?

I’ve gotten a bit further in debugging why this happens:
The Remapping container filesystem code in driver_lxc.go is never executed. However i was not able to figure out how to modify the config so this code gets executed. It has nothing to do with my setup, I also tried it in a fresh Debian VM with same results.

@stgraber Do you have an idea what I would need to change to get filesystem remapping code executed? I wanted to try the security.shifted config value, but I cannot set it for any volumes:

# incus storage volume set default container/test-new security.shifted=true   
Error: Invalid option for volume "test-new" option "security.shifted"

Many thanks in advance!

That’s perfectly normal. The Remapping container filesystem is a workaround for when kernel filesystems don’t support VFS idmap mounts, it’s slow and when it fails, leads to unrecoverable filesystem metadata.

On modern systems with supported filesystems, you’re going to see things work as you described.
The data on-disk is completely unshifted and the shift happens within the kernel through the new mount API and the VFS idmap feature.

This eliminates all the races, guess work and occasional mistakes from the manual shifting logic and also makes it possible to alter uid/gid maps without any other change being required.

We keep this safe by making it impossible for non-root users on the host to reach any of those files.

We’re also investigating additional safety nets to put around this, whether BPF LSM rules or a trick to keep all data in the host namespace as noexec and clearing the noexec flag only within the containers.
But those options are really meant as a second layer safety net and we consider the current setup to be perfectly safe. It’s also been how data has been stored for all shiftfs users under LXD for years.

Thank you very much for your explanation!

The concept of why the files aren’t shifted anymore seems secure, but if there is a bug somewhere and someone would manage to break out of a container they are root in, they might get complete control of the host by altering files belonging to root on the host…

If I want to accept the risk of remapping the container fs, is there still a config option that would re-enable it?

Hmm, that’s incorrect, the actual user uid/gid is still shifted.

Your root user in the container breaking out will land on the host as uid 1000000 / gid 1000000 which will give them as much access to the host as the nobody user.

Actually because the filesystem isn’t shifted, such a user breaking out will have less access than would have been the case before as now they can’t even access their container’s filesystem given they are 1000000 / 1000000 and the filesystem they were on is owned by 0 / 0.

That would effectively prevent them from accessing any of the files they could have written in the container prior to breaking out and instead have to do with solely what a nobody user can access on the host.

So if there was e.g. a file traversal vulnerability which would allow a root user from inside a container to write outside the container fs, the files would not be created as host root on the host, but as uid 1000000 / gid 1000000 ?

That’s correct.

Does this now mean that with Ubuntu 24.04 that I no longer need the “raw.idmap” to provide write access from inside of the container to the incus host when I map a folder from inside of the container to a folder on the incus host?

Yep, that’s right, if you set shift=true on that disk entry then you won’t need to do the raw.idmap trick

I removed the raw.id mapping, stopped the container, removed the disk device and added it with the shift=true. When I tried to restart, the container was stuck in:
image

Does that mean that I should have also removed the subgid/subuid file entries?

The container never started.

It finally came back:


image