great. The same behavior for me. mknod interception works but through fallback mechanism.
@k4my4b couldn’t you perform the same test as you did before, but on the system where mknod interception works good for you? I mean 5.15 kernel (or 5.17?). As far as I can see this problem with interception is present on 5.15 kernel too.
@amikhalitsyn mount -t overlay
fails this time.
Expand
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
PRIVACY_POLICY_URL="https://terms.archlinux.org/docs/privacy-policy/"
LOGO=archlinux-logo
Expand
Linux arch-ct 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux
Expand
/dev/sda2 on / type ext4 (rw,relatime,idmapped)
Expand
mount: /root/ovl: wrong fs type, bad option, bad superblock on overlay, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
Expand
File: /root/ovl/null
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 8,2Inode: 1291311 Links: 1 Device type: 1,3
Access: (0644/crw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-03-04 12:26:07.374439142 +0000
Modify: 2023-03-04 12:26:07.374439142 +0000
Change: 2023-03-04 12:26:07.374439142 +0000
Birth: 2023-03-04 12:26:07.374439142 +0000
Expand
dev on /dev/null type devtmpfs (rw,nosuid,relatime,size=8170840k,nr_inodes=2042710,mode=755,inode64)
you can check dmesg for errors after mount
failed.
[ 164.809290] overlayfs: idmapped layers are currently not supported
Linux archlinux 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux
yep, that’s correct behavior for old kernel versions. Probably on the production environment with an old kernel versions you are not using idmapped mounts. But the questions is, how all of this works for you now? You’ve started a topic from the question that something get broken on a newer kernel versions, but AFAIU you already have a working setup on older versions. And what I whan is to understand your setup on this “old” versions, your idmappings setup, how mknod interception works for you and so on. Couldn’t you describe all of this in details? If I have a minimal possible reproducer which works on your old kernel and doesn’t work on a new versions then I’ll be able to fix it, otherwise it may took too much time and forces…
@amikhalitsyn
I’m running the same exact kernel, 5.15.94-1-lts, on my production machine. The only main difference is that on my production machine I’m using BTRFS as opposed to EXT4.
Here’s everything from my production machine:
uname -a
Expand
Linux lxd 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux
cat /etc/os-release
Expand
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
PRIVACY_POLICY_URL="https://terms.archlinux.org/docs/privacy-policy/"
LOGO=archlinux-logo
cat /proc/cmdline
Expand
lsm=landlock,lockdown,yama,integrity,apparmor,bpf root=PARTUUID=cb40f3e4-6d53-4804-af33-ce12c85517a4 rootflags=subvol=@ rootfstype=btrfs rw ipv6.disable_ipv6=1 intel_pstate=no_hwp intel_iommu=on iommu=pt loglevel=3 rd.systemd.show_status=auto rd.udev.log_level=3
cat /etc/sub{uid,gid}
Expand
root:1000000:1000000000
root:1000000:1000000000
cat /etc/sysctl.d/*
Expand
fs.aio-max-nr = 524288
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
kernel.dmesg_restrict = 1
kernel.keys.maxbytes = 2000000
kernel.keys.maxkeys = 2000
net.core.netdev_max_backlog = 182757
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.neigh.default.gc_thresh3 = 8192
kernel.unprivileged_userns_clone=1
vm.max_map_count = 262144
cat /etc/security/limits.conf
Expand
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means, for example, that setting a limit for wildcard domain here
#can be overridden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overridden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain> <type> <item> <value>
#
#Where:
#<domain> can be:
# - a user name
# - a group name, with @group syntax
# - the wildcard *, for default entry
# - the wildcard %, can be also used with %group syntax,
# for maxlogin limit
#
#<type> can have the two values:
# - "soft" for enforcing the soft limits
# - "hard" for enforcing hard limits
#
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open file descriptors
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
#
#<domain> <type> <item> <value>
#
#* soft core 0
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#@student - maxlogins 4
# LXD recommendation
# Maximum number of open files and Maximum locked-in-memory address space (KB)
* soft nofile 1048576
* hard nofile 1048576
root soft nofile 1048576
root hard nofile 1048576
* soft memlock unlimited
* hard memlock unlimited
root soft memlock unlimited
root hard memlock unlimited
# Arch wiki recommendation
# You should disallow everyone except for root from having processes of
# minimal niceness (-20), so that riit cab fix an unresponsive system.
* hard nice -19
root hard nice -20
# End of file
lxc config show -e arch-ct
Expand
architecture: x86_64
config:
boot.autostart: "true"
image.architecture: amd64
image.description: Archlinux current amd64 (20230304_04:18)
image.os: Archlinux
image.release: current
image.requirements.secureboot: "false"
image.serial: "20230304_04:18"
image.type: squashfs
image.variant: default
security.idmap.isolated: "true"
security.idmap.size: "2000000"
security.nesting: "true"
security.privileged: "false"
security.secureboot: "false"
security.syscalls.intercept.mknod: "true"
volatile.base_image: f41991a6c61c46505053fe0adc8948ca6fe3a2a3b9414178905c1ef0a58b630c
volatile.cloud-init.instance-id: b0d5d126-876a-487e-b897-5dd717747587
volatile.eth0.host_name: veth7185bb12
volatile.eth0.hwaddr: 00:16:3e:84:34:07
volatile.idmap.base: "29065536"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":29065536,"Nsid":0,"Maprange":2000000},{"Isuid":false,"Isgid":true,"Hostid":29065536,"Nsid":0,"Maprange":2000000}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":29065536,"Nsid":0,"Maprange":2000000},{"Isuid":false,"Isgid":true,"Hostid":29065536,"Nsid":0,"Maprange":2000000}]'
volatile.last_state.idmap: '[]'
volatile.last_state.power: RUNNING
volatile.uuid: db47167b-bf39-44fd-a807-ed609dd1d612
devices:
eth0:
name: eth0
nictype: bridged
parent: br0
type: nic
root:
path: /
pool: default
type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
mount | grep idmap
Expand
/dev/sda2 on / type btrfs (rw,relatime,idmapped,ssd,space_cache=v2,user_subvol_rm_allowed,subvolid=41488,subvol=/@/var/lib/lxd/storage-pools/default/containers/arch-ct)
mkdir {work,upper,lower,ovl}
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
Expand
mount: /root/ovl: wrong fs type, bad option, bad superblock on overlay, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
[1443732.798184] overlayfs: idmapped layers are currently not supported
as we can see overlayfs is not getting mounted on top of idmapped btrfs. Which is also correct (for old kernel versions).
Okay, then how your setup work at all? (-: I’ve read your old reports about collabora-online. Are you using docker to deploy it? Could you check which docker storage driver are you using? I can assume that your Docker uses btrfs storage driver instead of overlayfs. This may explain how docker with idmapped mounts works for you at all on such an old kernel version.
that explains everything
Okay, so we have a problem with mknod interception not only on overlayfs, but on btrfs too.
I’ve checked the case of btrfs storage driver in Docker + mknod interception on 5.19 and 6.2. It works perfectly well.
So you need to describe your production configuration in detail and provide us with precise steps to reproduce a problem.
My current test setup was:
lxc launch ubuntu:22.04 idmap-test1 --storage btrfspool1
lxc config set idmap-test1 security.nesting=true
lxc config set idmap-test1 security.syscalls.intercept.mknod=true
lxc exec idmap-test1 bash
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# edit /etc/docker/docker.json
# {
# "storage-driver": "btrfs"
# }
service restart docker
docker run -it --rm busybox
mount | grep idmap
mknod /root/null c 1 3
rm -f /root/null
So, it’s the LXC container on the btrfs storage, with Docker container inside (with the btrfs storage driver). Interception works flawlessly.
Config
$ lxc config show idmap-test1 -e
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 22.04 LTS amd64 (release) (20230302)
image.label: release
image.os: ubuntu
image.release: jammy
image.serial: "20230302"
image.type: squashfs
image.version: "22.04"
security.nesting: "true"
security.syscalls.intercept.mknod: "true"
volatile.base_image: 72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
volatile.cloud-init.instance-id: 5366658d-21ee-48b1-9013-b1c517411981
volatile.eth0.host_name: veth60a8d5f1
volatile.eth0.hwaddr: 00:16:3e:fb:02:23
volatile.idmap.base: "0"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.idmap: '[]'
volatile.last_state.power: RUNNING
volatile.uuid: 49beb5b4-1f92-42fd-b2b3-5face2f3503d
devices:
eth0:
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: btrfspool1
type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
$ lxc storage show btrfspool1
config:
size: 6GiB
source: /var/snap/lxd/common/lxd/disks/btrfspool1.img
description: ""
name: btrfspool1
driver: btrfs
used_by:
- /1.0/images/72565f3fbae414d317b90569b6d7aa308c482fdf562aaf0c2eaa6e50fa39747b
- /1.0/instances/idmap-test1
status: Created
locations:
- none
Please try to reproduce the original issue and then simplify reproducer to find out a root cause.
@amikhalitsyn I can confirm setting the storage driver to btrfs works indeed! I did also try overlay, overlay2 and fuse-overlayfs and none of those worked so it seems btrfs is the only one working.
but you’ve written:
I reported an issue with
security.syscalls.intercept.mknod
misbehaving/not functioning as intended with anything beyond5.15 LTS
some time ago and now I’ve tested it again with the new6.1 LTS
and it still doesn’t work, so I thought I might just mention it again
As far as I understand, something was broken for you after 5.15 LTS kernel, correct? But now you are writing that btrfs worked before, and is working now (with fresh kernels), correct? Then what’s the problem? Where the kernel regression is?
@amikhalitsyn
turns out unless one manually forces docker to use btrfs docker defaults to overlay2 after 5.19 and
/etc/docker/daemon.json
_____________________________________
{
"storage-driver": "btrfs"
}
mknod interception does not seem to work then, I did not know this until now that you’ve mentioned it.
I am not sure if this is a kernel regression or a docker issue or otherwise, I am only certain this occurs after 5.19 and prior to that it seems docker will select the btrfs storage driver instead of overlay2, however, I tested this with ext4 and then docker opts for vfs driver in which case mknod works there too.
uname -a
#Linux archlinux 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
truncate -s 10GiB btrfspool.img
losetup -f btrfspool.img
lxc storage create btrfspool btrfs source=/dev/loop0
lxc init images:archlinux docker-btrfs --storage=btrfspool
lxc config set docker-btrfs security.{nesting=true,syscalls.intercept.mknod=true}
lxc start docker-btrfs
lxc exec docker-btrfs -- su -l
pacman -S vim docker
mkdir -p /etc/docker
echo -e '{\n\t"storage-driver":"btrfs"\n}' > /etc/docker/daemon.json
systemctl enable --now docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
exit
sed -i 's/btrfs/overlay2/' /etc/docker/daemon.json
systemctl restart docker.service
docker run -it --rm busybox
mknod /root/null c 1 3
# mknod: /root/null: Operation not permitted
Probably it’s because before 5.19 overlayfs was fail to mount on top of idmapped mount. And if the container rootfs mount was idmapped then docker used btfs (or vfs) as a fallback storage drivers. And yes, it explains why on ext4 you have vfs driver, but on btrfs you have btrfs driver.
# mknod: /root/null: Operation not permitted
That’s weird. Because mknod interception on overlayfs doesn’t lead to -EACCESS error, it just goes to fallback method and use the bindmount of a device node from the host. And this is bad, but not so bad as EACCESS. Is this command listing was really executed by you and you can confirm that EACCESS is reproducible?
@amikhalitsyn affirmative.
uname -a
Linux 44b0b877ba2f 6.1.15-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 03 Mar 2023 12:22:08 +0000 x86_64 GNU/Linux
mount | grep idmap
/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
docker info
...
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 2
Server Version: 23.0.1
Storage Driver: overlay2
Backing Filesystem: btrfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: false
userxattr: true
...
docker run -it --rm busybox
mknod /root/null c 1 3
mknod: /root/null: Operation not permitted
interesting, but if you do our old test:
mount | grep idmap
mkdir {work,upper,lower,ovl}
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
mknod mknod /root/ovl/null c 1 3
stat /root/ovl/null
mount | grep null
inside the same container where you’ve done this experiment with docker, what happens?
I just want to sort out and classify all the problem, so we can analyze this internally and decide importance/priorities.
@amikhalitsyn inside the docker container (docker run -it --rm busybox
) or the LXD container nesting docker?
Inside LXD container (works fine):
/dev/disk/by-uuid/d8183cdb-d608-4130-888e-87b91f7e0d68 on / type btrfs (rw,relatime,idmapped,space_cache=v2,user_subvol_rm_allowed,subvolid=295,subvol=/containers/docker-overlayfs)
File: /root/ovl/null
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 8,2 Inode: 767236 Links: 0 Device type: 1,3
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-03-08 15:06:45.844343223 +0000
Modify: 2023-03-08 15:06:45.844343223 +0000
Change: 2023-03-08 15:06:45.864343975 +0000
Birth: 2023-03-08 15:06:45.844343223 +0000
dev on /dev/null type devtmpfs (rw,nosuid,relatime,size=8169348k,nr_inodes=2042337,mode=755,inode64)
/dev/sda2 on /root/ovl/null type ext4 (rw,relatime)
You’ve got it right, inside the LXC container.
Yep, and as you can see from result, mknod
is working but (!) it creates bindmount in the place of /root/ovl/null, but not the device node (compare with your previous experiments on btrfs).
Has there been any movement on this? It’s been a pretty large issue within my org, preventing software from working correctly. I am using unprivileged lxc container running docker, and the overlay2/overlayfs drivers still are not working. And reformatting all of my servers to use btrfs is not an option.
Docker version 24.0.2, build cb74dfc
lxc version 5.0.2
Linux 5.15.108-1-pve #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) x86_64 GNU/Linux