Linux 6.1 (LTS) confirmed, but no fix for mknod yet

as we can see overlayfs is not getting mounted on top of idmapped btrfs. Which is also correct (for old kernel versions).

Okay, then how your setup work at all? (-: I’ve read your old reports about collabora-online. Are you using docker to deploy it? Could you check which docker storage driver are you using? I can assume that your Docker uses btrfs storage driver instead of overlayfs. This may explain how docker with idmapped mounts works for you at all on such an old kernel version.

@amikhalitsyn I am using docker indeed

docker info
...
 Storage Driver: btrfs
...

that explains everything :slight_smile:

Okay, so we have a problem with mknod interception not only on overlayfs, but on btrfs too.

1 Like

I’ve checked the case of btrfs storage driver in Docker + mknod interception on 5.19 and 6.2. It works perfectly well.

So you need to describe your production configuration in detail and provide us with precise steps to reproduce a problem.

My current test setup was:

lxc launch ubuntu:22.04 idmap-test1 --storage btrfspool1
lxc config set idmap-test1 security.nesting=true
lxc config set idmap-test1 security.syscalls.intercept.mknod=true
lxc exec idmap-test1 bash
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# edit /etc/docker/docker.json
# {
#  "storage-driver": "btrfs"
# }
service restart docker
docker run -it --rm busybox
mount | grep idmap
mknod /root/null c 1 3
rm -f /root/null

So, it’s the LXC container on the btrfs storage, with Docker container inside (with the btrfs storage driver). Interception works flawlessly.

Config
$ lxc config show idmap-test1 -e
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230302)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230302"
  image.type: squashfs
  image.version: "22.04"
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  volatile.base_image: 72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
  volatile.cloud-init.instance-id: 5366658d-21ee-48b1-9013-b1c517411981
  volatile.eth0.host_name: veth60a8d5f1
  volatile.eth0.hwaddr: 00:16:3e:fb:02:23
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 49beb5b4-1f92-42fd-b2b3-5face2f3503d
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: btrfspool1
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
$ lxc storage show btrfspool1
config:
  size: 6GiB
  source: /var/snap/lxd/common/lxd/disks/btrfspool1.img
description: ""
name: btrfspool1
driver: btrfs
used_by:
- /1.0/images/72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
- /1.0/instances/idmap-test1
status: Created
locations:
- none

Please try to reproduce the original issue and then simplify reproducer to find out a root cause.

@amikhalitsyn I can confirm setting the storage driver to btrfs works indeed! I did also try overlay, overlay2 and fuse-overlayfs and none of those worked so it seems btrfs is the only one working.

but you’ve written:

I reported an issue with security.syscalls.intercept.mknod misbehaving/not functioning as intended with anything beyond 5.15 LTS some time ago and now I’ve tested it again with the new 6.1 LTS and it still doesn’t work, so I thought I might just mention it again

As far as I understand, something was broken for you after 5.15 LTS kernel, correct? But now you are writing that btrfs worked before, and is working now (with fresh kernels), correct? Then what’s the problem? Where the kernel regression is?

@amikhalitsyn
turns out unless one manually forces docker to use btrfs docker defaults to overlay2 after 5.19 and

/etc/docker/daemon.json
_____________________________________
{
        "storage-driver": "btrfs"
}

mknod interception does not seem to work then, I did not know this until now that you’ve mentioned it.
I am not sure if this is a kernel regression or a docker issue or otherwise, I am only certain this occurs after 5.19 and prior to that it seems docker will select the btrfs storage driver instead of overlay2, however, I tested this with ext4 and then docker opts for vfs driver in which case mknod works there too.

uname -a
#Linux archlinux 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
truncate -s 10GiB btrfspool.img
losetup -f btrfspool.img
lxc storage create btrfspool btrfs source=/dev/loop0
lxc init images:archlinux docker-btrfs --storage=btrfspool
lxc config set docker-btrfs security.{nesting=true,syscalls.intercept.mknod=true}
lxc start docker-btrfs
lxc exec docker-btrfs -- su -l
pacman -S vim docker
mkdir -p /etc/docker
echo -e '{\n\t"storage-driver":"btrfs"\n}' > /etc/docker/daemon.json
systemctl enable --now docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
exit
sed -i 's/btrfs/overlay2/' /etc/docker/daemon.json
systemctl restart docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
# mknod: /root/null: Operation not permitted

Probably it’s because before 5.19 overlayfs was fail to mount on top of idmapped mount. And if the container rootfs mount was idmapped then docker used btfs (or vfs) as a fallback storage drivers. And yes, it explains why on ext4 you have vfs driver, but on btrfs you have btrfs driver.

# mknod: /root/null: Operation not permitted

That’s weird. Because mknod interception on overlayfs doesn’t lead to -EACCESS error, it just goes to fallback method and use the bindmount of a device node from the host. And this is bad, but not so bad as EACCESS. Is this command listing was really executed by you and you can confirm that EACCESS is reproducible?

@amikhalitsyn affirmative.

uname -a
Linux 44b0b877ba2f 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
mount | grep idmap
/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
docker info 
...
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 2
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: false
  userxattr: true
...
docker run -it --rm busybox
mknod /root/null c 1 3
mknod: /root/null: Operation not permitted

interesting, but if you do our old test:

mount | grep idmap
mkdir {work,upper,lower,ovl}
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
mknod mknod /root/ovl/null c 1 3
stat /root/ovl/null
mount | grep null

inside the same container where you’ve done this experiment with docker, what happens?
I just want to sort out and classify all the problem, so we can analyze this internally and decide importance/priorities.

@amikhalitsyn inside the docker container (docker run -it --rm busybox) or the LXD container nesting docker?
Inside LXD container (works fine):

/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
  File: /root/ovl/null
  Size: 0               Blocks: 0          IO Block: 4096   character special file
Device: 8,2     Inode: 767236      Links: 0     Device type: 1,3
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-03-08 15:06:45.844343223 +0000
Modify: 2023-03-08 15:06:45.844343223 +0000
Change: 2023-03-08 15:06:45.864343975 +0000
 Birth: 2023-03-08 15:06:45.844343223 +0000
dev on /dev/null type devtmpfs (rw,nosuid,relatime,size=8169348k,nr_inodes=2042337,mode=755,inode64)
/dev/sda2 on /root/ovl/null type ext4 (rw,relatime)

You’ve got it right, inside the LXC container.

Yep, and as you can see from result, mknod is working but (!) it creates bindmount in the place of /root/ovl/null, but not the device node (compare with your previous experiments on btrfs).

1 Like

Has there been any movement on this? It’s been a pretty large issue within my org, preventing software from working correctly. I am using unprivileged lxc container running docker, and the overlay2/overlayfs drivers still are not working. And reformatting all of my servers to use btrfs is not an option.

Docker version 24.0.2, build cb74dfc
lxc version 5.0.2
Linux 5.15.108-1-pve #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) x86_64 GNU/Linux

1 Like