Linux 6.1 (LTS) confirmed, but no fix for mknod yet

great. The same behavior for me. mknod interception works but through fallback mechanism.

@k4my4b couldn’t you perform the same test as you did before, but on the system where mknod interception works good for you? I mean 5.15 kernel (or 5.17?). As far as I can see this problem with interception is present on 5.15 kernel too.

@amikhalitsyn mount -t overlay fails this time.

Expand
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
PRIVACY_POLICY_URL="https://terms.archlinux.org/docs/privacy-policy/"
LOGO=archlinux-logo
Expand
Linux arch-ct 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux
Expand
/dev/sda2 on / type ext4 (rw,relatime,idmapped)
Expand
mount: /root/ovl: wrong fs type, bad option, bad superblock on overlay, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
Expand
  File: /root/ovl/null
  Size: 0         Blocks: 0          IO Block: 4096   character special file
Device: 8,2Inode: 1291311     Links: 1     Device type: 1,3
Access: (0644/crw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-03-04 12:26:07.374439142 +0000
Modify: 2023-03-04 12:26:07.374439142 +0000
Change: 2023-03-04 12:26:07.374439142 +0000
 Birth: 2023-03-04 12:26:07.374439142 +0000
Expand
dev on /dev/null type devtmpfs (rw,nosuid,relatime,size=8170840k,nr_inodes=2042710,mode=755,inode64)

you can check dmesg for errors after mount failed.

@amikhalitsyn

[  164.809290] overlayfs: idmapped layers are currently not supported
Linux archlinux 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux

yep, that’s correct behavior for old kernel versions. Probably on the production environment with an old kernel versions you are not using idmapped mounts. But the questions is, how all of this works for you now? You’ve started a topic from the question that something get broken on a newer kernel versions, but AFAIU you already have a working setup on older versions. And what I whan is to understand your setup on this “old” versions, your idmappings setup, how mknod interception works for you and so on. Couldn’t you describe all of this in details? If I have a minimal possible reproducer which works on your old kernel and doesn’t work on a new versions then I’ll be able to fix it, otherwise it may took too much time and forces…

@amikhalitsyn
I’m running the same exact kernel, 5.15.94-1-lts, on my production machine. The only main difference is that on my production machine I’m using BTRFS as opposed to EXT4.
Here’s everything from my production machine:

uname -a
Expand
Linux lxd 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux
cat /etc/os-release
Expand
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
PRIVACY_POLICY_URL="https://terms.archlinux.org/docs/privacy-policy/"
LOGO=archlinux-logo
cat /proc/cmdline
Expand
 lsm=landlock,lockdown,yama,integrity,apparmor,bpf  root=PARTUUID=cb40f3e4-6d53-4804-af33-ce12c85517a4 rootflags=subvol=@ rootfstype=btrfs rw  ipv6.disable_ipv6=1  intel_pstate=no_hwp  intel_iommu=on iommu=pt  loglevel=3 rd.systemd.show_status=auto rd.udev.log_level=3
cat /etc/sub{uid,gid}
Expand
root:1000000:1000000000
root:1000000:1000000000
cat /etc/sysctl.d/*
Expand
fs.aio-max-nr = 524288
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
kernel.dmesg_restrict = 1
kernel.keys.maxbytes = 2000000
kernel.keys.maxkeys = 2000
net.core.netdev_max_backlog = 182757
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.neigh.default.gc_thresh3 = 8192
kernel.unprivileged_userns_clone=1
vm.max_map_count = 262144
cat /etc/security/limits.conf
Expand
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means, for example, that setting a limit for wildcard domain here
#can be overridden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overridden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open file descriptors
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# LXD recommendation
# Maximum number of open files and Maximum locked-in-memory address space (KB)
*       soft    nofile  1048576
*       hard    nofile  1048576
root    soft    nofile  1048576
root    hard    nofile  1048576
*       soft    memlock unlimited
*       hard    memlock unlimited
root    soft    memlock unlimited
root    hard    memlock unlimited

# Arch wiki recommendation
# You should disallow everyone except for root from having processes of 
# minimal niceness (-20), so that riit cab fix an unresponsive system.
*       hard    nice    -19
root    hard    nice    -20

# End of file
lxc config show -e arch-ct
Expand
architecture: x86_64
config:
  boot.autostart: "true"
  image.architecture: amd64
  image.description: Archlinux current amd64 (20230304_04:18)
  image.os: Archlinux
  image.release: current
  image.requirements.secureboot: "false"
  image.serial: "20230304_04:18"
  image.type: squashfs
  image.variant: default
  security.idmap.isolated: "true"
  security.idmap.size: "2000000"
  security.nesting: "true"
  security.privileged: "false"
  security.secureboot: "false"
  security.syscalls.intercept.mknod: "true"
  volatile.base_image: f41991a6c61c46505053fe0adc8948ca6fe3a2a3b9414178905c1ef0a58b630c
  volatile.cloud-init.instance-id: b0d5d126-876a-487e-b897-5dd717747587
  volatile.eth0.host_name: veth7185bb12
  volatile.eth0.hwaddr: 00:16:3e:84:34:07
  volatile.idmap.base: "29065536"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":29065536,"Nsid":0,"Maprange":2000000},{"Isuid":false,"Isgid":true,"Hostid":29065536,"Nsid":0,"Maprange":2000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":29065536,"Nsid":0,"Maprange":2000000},{"Isuid":false,"Isgid":true,"Hostid":29065536,"Nsid":0,"Maprange":2000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: db47167b-bf39-44fd-a807-ed609dd1d612
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
mount | grep idmap
Expand
/dev/sda2 on / type btrfs (rw,relatime,idmapped,ssd,space_cache=v2,user_subvol_rm_allowed,subvolid=41488,subvol=/@/var/lib/lxd/storage-pools/default/containers/arch-ct)
mkdir {work,upper,lower,ovl}
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
Expand
mount: /root/ovl: wrong fs type, bad option, bad superblock on overlay, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
[1443732.798184] overlayfs: idmapped layers are currently not supported

as we can see overlayfs is not getting mounted on top of idmapped btrfs. Which is also correct (for old kernel versions).

Okay, then how your setup work at all? (-: I’ve read your old reports about collabora-online. Are you using docker to deploy it? Could you check which docker storage driver are you using? I can assume that your Docker uses btrfs storage driver instead of overlayfs. This may explain how docker with idmapped mounts works for you at all on such an old kernel version.

@amikhalitsyn I am using docker indeed

docker info
...
 Storage Driver: btrfs
...

that explains everything :slight_smile:

Okay, so we have a problem with mknod interception not only on overlayfs, but on btrfs too.

1 Like

I’ve checked the case of btrfs storage driver in Docker + mknod interception on 5.19 and 6.2. It works perfectly well.

So you need to describe your production configuration in detail and provide us with precise steps to reproduce a problem.

My current test setup was:

lxc launch ubuntu:22.04 idmap-test1 --storage btrfspool1
lxc config set idmap-test1 security.nesting=true
lxc config set idmap-test1 security.syscalls.intercept.mknod=true
lxc exec idmap-test1 bash
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# edit /etc/docker/docker.json
# {
#  "storage-driver": "btrfs"
# }
service restart docker
docker run -it --rm busybox
mount | grep idmap
mknod /root/null c 1 3
rm -f /root/null

So, it’s the LXC container on the btrfs storage, with Docker container inside (with the btrfs storage driver). Interception works flawlessly.

Config
$ lxc config show idmap-test1 -e
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230302)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230302"
  image.type: squashfs
  image.version: "22.04"
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  volatile.base_image: 72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
  volatile.cloud-init.instance-id: 5366658d-21ee-48b1-9013-b1c517411981
  volatile.eth0.host_name: veth60a8d5f1
  volatile.eth0.hwaddr: 00:16:3e:fb:02:23
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 49beb5b4-1f92-42fd-b2b3-5face2f3503d
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: btrfspool1
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
$ lxc storage show btrfspool1
config:
  size: 6GiB
  source: /var/snap/lxd/common/lxd/disks/btrfspool1.img
description: ""
name: btrfspool1
driver: btrfs
used_by:
- /1.0/images/72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
- /1.0/instances/idmap-test1
status: Created
locations:
- none

Please try to reproduce the original issue and then simplify reproducer to find out a root cause.

@amikhalitsyn I can confirm setting the storage driver to btrfs works indeed! I did also try overlay, overlay2 and fuse-overlayfs and none of those worked so it seems btrfs is the only one working.

but you’ve written:

I reported an issue with security.syscalls.intercept.mknod misbehaving/not functioning as intended with anything beyond 5.15 LTS some time ago and now I’ve tested it again with the new 6.1 LTS and it still doesn’t work, so I thought I might just mention it again

As far as I understand, something was broken for you after 5.15 LTS kernel, correct? But now you are writing that btrfs worked before, and is working now (with fresh kernels), correct? Then what’s the problem? Where the kernel regression is?

@amikhalitsyn
turns out unless one manually forces docker to use btrfs docker defaults to overlay2 after 5.19 and

/etc/docker/daemon.json
_____________________________________
{
        "storage-driver": "btrfs"
}

mknod interception does not seem to work then, I did not know this until now that you’ve mentioned it.
I am not sure if this is a kernel regression or a docker issue or otherwise, I am only certain this occurs after 5.19 and prior to that it seems docker will select the btrfs storage driver instead of overlay2, however, I tested this with ext4 and then docker opts for vfs driver in which case mknod works there too.

uname -a
#Linux archlinux 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
truncate -s 10GiB btrfspool.img
losetup -f btrfspool.img
lxc storage create btrfspool btrfs source=/dev/loop0
lxc init images:archlinux docker-btrfs --storage=btrfspool
lxc config set docker-btrfs security.{nesting=true,syscalls.intercept.mknod=true}
lxc start docker-btrfs
lxc exec docker-btrfs -- su -l
pacman -S vim docker
mkdir -p /etc/docker
echo -e '{\n\t"storage-driver":"btrfs"\n}' > /etc/docker/daemon.json
systemctl enable --now docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
exit
sed -i 's/btrfs/overlay2/' /etc/docker/daemon.json
systemctl restart docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
# mknod: /root/null: Operation not permitted

Probably it’s because before 5.19 overlayfs was fail to mount on top of idmapped mount. And if the container rootfs mount was idmapped then docker used btfs (or vfs) as a fallback storage drivers. And yes, it explains why on ext4 you have vfs driver, but on btrfs you have btrfs driver.

# mknod: /root/null: Operation not permitted

That’s weird. Because mknod interception on overlayfs doesn’t lead to -EACCESS error, it just goes to fallback method and use the bindmount of a device node from the host. And this is bad, but not so bad as EACCESS. Is this command listing was really executed by you and you can confirm that EACCESS is reproducible?

@amikhalitsyn affirmative.

uname -a
Linux 44b0b877ba2f 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
mount | grep idmap
/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
docker info 
...
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 2
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: false
  userxattr: true
...
docker run -it --rm busybox
mknod /root/null c 1 3
mknod: /root/null: Operation not permitted

interesting, but if you do our old test:

mount | grep idmap
mkdir {work,upper,lower,ovl}
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
mknod mknod /root/ovl/null c 1 3
stat /root/ovl/null
mount | grep null

inside the same container where you’ve done this experiment with docker, what happens?
I just want to sort out and classify all the problem, so we can analyze this internally and decide importance/priorities.

@amikhalitsyn inside the docker container (docker run -it --rm busybox) or the LXD container nesting docker?
Inside LXD container (works fine):

/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
  File: /root/ovl/null
  Size: 0               Blocks: 0          IO Block: 4096   character special file
Device: 8,2     Inode: 767236      Links: 0     Device type: 1,3
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-03-08 15:06:45.844343223 +0000
Modify: 2023-03-08 15:06:45.844343223 +0000
Change: 2023-03-08 15:06:45.864343975 +0000
 Birth: 2023-03-08 15:06:45.844343223 +0000
dev on /dev/null type devtmpfs (rw,nosuid,relatime,size=8169348k,nr_inodes=2042337,mode=755,inode64)
/dev/sda2 on /root/ovl/null type ext4 (rw,relatime)

You’ve got it right, inside the LXC container.

Yep, and as you can see from result, mknod is working but (!) it creates bindmount in the place of /root/ovl/null, but not the device node (compare with your previous experiments on btrfs).

1 Like

Has there been any movement on this? It’s been a pretty large issue within my org, preventing software from working correctly. I am using unprivileged lxc container running docker, and the overlay2/overlayfs drivers still are not working. And reformatting all of my servers to use btrfs is not an option.

Docker version 24.0.2, build cb74dfc
lxc version 5.0.2
Linux 5.15.108-1-pve #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) x86_64 GNU/Linux

1 Like