RDMA/Infiniband in unprivileged container (HugeTLB Pages issue?)

Hello all,

Context

I am trying to run a program that uses Nvidia Rivermax as a regular (nonr-root) user inside an unprivileged Incus container.
The program deals with fast networked cameras.

I have passed through the NIC and related Infiniband nodes:

  devchar:
    path: /dev/char
    readonly: "true"
    source: /dev/char
    type: disk
  ibnic:
    mtu: "9000"
    nictype: physical
    parent: ens1f1np1
    type: nic
  uverbs1:
    path: /dev/infiniband/uverbs1
    type: unix-char

In an privileged container, all works well (both as non-root and as root user).

Problem

In an unprivileged container, however, the program segfaults when I try to allocate a framebuffer. I can see a few failing system calls, the first one seems to be:

shmget(IPC_PRIVATE, 16777216, SHM_HUGETLB|0600) = -1 EPERM (Operation not permitted)

Therefore, I think it may be related to hugepages not being available in the unprivileged container.

Attempted solution

I then tried to set some limits and capabilities:

config:
  limits.kernel.memlock: unlimited
incus config set acquisition security.syscalls.intercept.mount true
incus config set acquisition security.syscalls.intercept.mount.allowed hugetlbfs
incus config set acquisition security.syscalls.intercept.mount.shift true

I also tried to set capabilities (It definitely needs CAP_NET_RAW, even outside the container):

incus config set acquisition raw.lxc=lxc.cap.keep="ipc_lock net_raw"

But, although these get set successfully (Incus doesn’t complain), the container refuses to start with these in place. No error at all, nothing in the log. It just remains “stopped” after issuing “start”.

I’m out of ideas at the moment. Any help or pointers are greatly appreciated.

Best,
Mathijs

@amikhalitsyn did you ever look at how shmget functions and whether there’s a safe way to use that in unprivileged containers?

of course, shmget is IPC namespaced thing. And shmget it self is allowed in unprivileged containers and always was.

I believe that problem is not shmget itself, but SHM_HUGETLB flag. If we look into can_do_hugetlb_shm function linux/fs/hugetlbfs/inode.c at b320789d6883cc00ac78ce83bccbfe7ed58afcf0 · torvalds/linux · GitHub we see capable(CAP_IPC_LOCK) || in_group_p(shm_group);

It can be allowed in unprivileged container, by making an appropriate GID map to the container and adding setting this GID to vm.hugetlb_shm_group sysctl (by default it is group id 0).

We can do some research around hugetlbfs to understand why it wasn’t allowed in the user namespace in the first place. May be we can lift this permission requirements down.

@ostheer, what you can try to do is to set security.idmap.base to 0. This will make your UID/GID mappings to be 1:1 to the host. (@stgraber am I right here?) Then everything should start to work for you. This will make your setup a little bit less secure, but still 10000 times better, than using privileged containers :wink:

security.idmap.base may work but I think this depends on security.idmap.isolated being enabled.

Instead raw.idmap could be used to directly map the relevant uid, gid or both straight through for whatever gid the sysctl is set to.

Thank you both for the replies! I will try setting the idmap tonight.

On the host, vm.hugetlb_shm_group indeed is 0.
Therefore, I would expect that only the root group would have permission.

However, in privileged mode (or on the host, outside the container for that matter), the regular (non-root) user can successfully use the program (despite not having GID 0).

I feel like it would be cleanest to not set the idmap shift to 0, but to instead add another GID to vm.hugetlb_shm_group. But the fact that my non-0 user is able to shmget without changing vm.hugetlb_shm_group is confusing to me.

Hi,

But the fact that my non-0 user is able to shmget without changing vm.hugetlb_shm_group is confusing to me.

You are right. We have to understand this. Because from what I see in the kernel code, only a user with CAP_IPC_LOCK capability (on the host!) or member of a group vm.hugetlb_shm_group can make a new SHM_HUGETLB mapping.

Could you tell us how you run your program (when doing it on the host)? And which program it is?

Can it be that this program’s binary has a SUID bit set?

I just tried in a container with these settings:

security.idmap.base 0
security.idmap.isolated true
security.privileged false

And the program still does not work. Not even as root (which then really has ID 0).

I also verified that on the host (outside container), the program runs fine as a normal user (ID 1000), as long as I grant CAP_NET_RAW (which is also required in the priv. container).
The SUID bit is not set.

The program is a self-compiled C++ binary, using the Emergent Vision Technologies “eSDK” to capture raw image data from network cameras with low overhead.
The eSDK uses DMA through Nvidia Rivermax to bypass the networking stack for header splitting.

The program is run simply like this:

% sudo setcap -r ~/cam_test
% sudo setcap cap_net_raw=ep ~/cam_test
% ~/cam_test
Camera 0 configured
Camera 1 configured
...
Opening Stream and allocating buffers
Size 4512, 4512
Triggering cameras
Captured all 6 frames with lineTime 2081

If I omit CAP_NET_RAW, I get the same error on the host as what I get in an unprivileged container (with or without cap):

shmget(IPC_PRIVATE, 16777216, SHM_HUGETLB|0600) = -1 EPERM (Operation not permitted)

This seems to confirm that the (G)ID is not the critical factor at the moment.
Instead, could it be that the capabilities (NET_RAW and/or IPC_LOCK) are not available in the unprivileged container?

When I try to keep those capabilities using the raw.lxc property, the container refuses to start (see original post).

Even this hypothesis seems to go against @amikhalitsyn’s suggestion though; he says that being in the vm.hugetlb_shm_group should be sufficient (i.e. when GID==0, IPC_LOCK cap is not required).

Let’s go step by step.

  1. Have you tried sudo -g#0 ~/cam_test without setting any capabilities on the file itself?

Even this hypothesis seems to go against @amikhalitsyn’s suggestion though; he says that being in the vm.hugetlb_shm_group should be sufficient (i.e. when GID==0, IPC_LOCK cap is not required).

I’ve written the following program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/shm.h>

int main(int argc, char **argv)
{
	int shmid, ret;
	char buf[200];

	/* print our creds */
	system("id");
	snprintf(buf, sizeof(buf) - 1, "cat /proc/%d/status | grep -i cap", getpid());
	system(buf);

	/* try to get a memory segment */
	shmid = shmget(IPC_PRIVATE, 1024*1024, SHM_HUGETLB | 0600);
	if (shmid == -1) {
		perror("shmget");
		exit(1);
	}

	if (shmid != -1) {
		/* release memory region */
		ret = shmctl(shmid, IPC_RMID, 0);
		if (ret == -1) {
			perror("shmctl");
			exit(1);
		}
	}

	return 0;
}

if I compile it with gcc testcap.c -o testcap and then run as sudo -g#0 ./testcap it works like a charm (and, of course, fails if I run it as ./testcap). So, having a GID == 0 does help to solve the issue with shmget(SHM_HUGETLB).

Capabilities are user-namespaced. You can have CAP_SYS_ADMIN capability in the unprivileged container and at the same time be absolutely regular unprivileged user on the host. That’s why these containers are called “unprivileged”. In Linux kernel code we have two kinds of capability checks. One looks like capable(CAP_SYS_TIME), for example. And second one looks like ns_capable(file->f_cred->user_ns, CAP_NET_ADMIN). Difference between these two is that the first one ensures that user has CAP_SYS_TIME capability on the host, while second one ensures that user has CAP_NET_ADMIN capability in the file->f_cred->user_ns user namespace. In case of SHM_HUGETLB thing we discuss, kernel checks capable(CAP_IPC_LOCK) || in_group_p(shm_group). We should read it as “user has CAP_IPC_LOCK capability on the host or member of a shm_group on the host. If you are running your application in unprivileged container, you have no way to give it CAP_IPC_LOCK on the host, because that defeats the point of unprivileged container. But you can still make your user inside that container to be a member of shm_group on the host (by manipulating GID mappings of container’s user namespace).

4. [Hypothesis] Now about CAP_NET_RAW and this weirdness around it. Basically, there is no logical relation between shmget(SHM_HUGETLB) and CAP_NET_RAW. But, what can be happening in your case is that your eSDK thing has some different ways of Inter-Process-Communication and one might use SysV IPC (shmget() and friends), and another one might user network sockets and might do something very tricky with these sockets that can require CAP_NET_RAW. So, most likely, when you give your application CAP_NET_RAW it just goes with the network sockets way and never calls shmget(SHM_HUGETLB) while if you remove this capability and run your application without it then your application tries to use a fallback way with shmget(SHM_HUGETLB) and fails on it too, because it lacks CAP_IPC_LOCK and it is not a member of shm_group. You can prove/disprove this by trying to strace your program like this sudo strace -f -e shmget ~/cam_testthis will show if you program even tries to call shmget when has sufficient privileges.

I can confirm that your test program works the way you describe: Without any caps, in the privileged container, it works with GID 0, and does not work without it.

Thank you for the info about how capabilities work in Linux, that’s very helpful.

About the problem at hand, I’ve done a few more tests, and found the following:

  • The program (cam_test) does not work without CAP_NET_RAW when run with GID 0.
  • The program does work without any caps in a root shell

Now I made a copy of strace, which I granted CAP_NET_RAW, named strace2.

  • Running ./strace2 -e shmget cam_test gives a surprising result: The shmget calls fail (EPERM), but the program still works (buffers allocated, frames captured from camera)
  • Running sudo -g#0 ./strace2 -e shmget cam_test also works, however this time, the shmget calls succeed.

Finally, running with GID 0, but without CAP_NET_RAW on strace2, I see:

  • The program does not work. Failing system calls are all ioctl(16, RDMA_VERBS_IOCTL, ...). A few times with EINVAL and once with EPERM. A few such calls do succeed.

So, to me it seems that the IPC_LOCK/shmget was a red herring. It looks like HugeTLB is attempted, but some sort of fallback is used if not available. I contacted Emergent, and they said the shmget call comes from within the Nvidia/Mellanox/Rivermax library.
The real reason my program does not work in an unprivileged container is simply the missing NET_RAW capability.
I have yet to test this, but I expect the shmget calls to succeed in the privileged and unprivileged container if I match up the ID’s w.r.t. vm.hugetlb_shm_group. This will not make cam_test work however, because that still requires NET_RAW.
About your hypothesis: I think it’s not exactly right, but was definitely on the right path about a fallback mechanism being at play.
Correct me if this sounds wrong.

My remaining questions:

  1. I’m not sure if it’s possible to answer this question, as it may be entirely application-specific, but is it possible that the failing ioctl call is indeed caused by missing NET_RAW capability?
  2. Assuming yes to point 1, is there any way to still make all of this work in an unprivileged container? Assuming all we need is CAP_NET_RAW.
  3. In my original post I mentioned trying setting raw.lxc=lxc.cap.keep="net_raw". With this parameter set, however, the container does not start. Is expected to not start/is keeping caps like that unsupported? If it should have worked, do you expect it would even solve the problem?
1 Like

root shell == as a root user? If you ran a program as a root user it has all the capabilities even if you don’t set them on the file. For instance you can do cat /proc/self/status | grep -i capeff from a root shell and you’ll see something like 000001ffffffffff which means that all caps are set. While if you run the same as an unprivileged user you’ll see 0000000000000000.

Great experiment! This is what we need.

yes, precisely.

It is possible to answer, but I need to see all the arguments to this ioctl() syscall. My wild guess could be that this ( RDMA/core: Add support to set privileged QKEY parameter · torvalds/linux@465d6b4 · GitHub ) might be related. So if you can try to do something like rdma system set privileged-qkey on on your system (from the commit message) I would try and check if this helps (you need to have a Linux kernel >= 6.7)

I analyzed kernel code and can see only a few cases where we check for CAP_NET_RAW in initial user namespace, in most cases we check for this capability in a container’s user namespace (and these cases are good for us). Most of this “bad” cases are related to infiniband driver, mlx5 driver, appletalk, ax25, bluetooth, mctcp, nfc, ieee802154. From this list, I guess, only first two are anyhow related and used in your case.

No, this won’t work, because you can’t have a unprivileged container (i.e. using a user namespace) and at the same time keep this capability from initial user namespace.

While you need to understand that in unprivileged container you have all capabilities, but they are tied to the user namespace of the container. For example, you can have CAP_NET_RAW in the initial user namespace, and it will be a part of a “root user superpower”, or you can have CAP_NET_RAW in the container’s user namespace, and this will allow you to do a lot of stuff protected with ns_capable(ns, CAP_NET_RAW) but will fail capable(CAP_NET_RAW) check. By default, when you launch an unprivileged container, a root user inside it has all capabilities (including CAP_NET_RAW, CAP_IPC_LOCK, etc), but these are only limited to a user namespace-attached resources.

I know. It’s all very complicated. (-:

Interesting. Thanks again for the help and explanation!
This definitely helps clear up the situation.

Thanks also for pointing me to that QKEY info.
I tried setting it (sudo /opt/mellanox/iproute2/sbin/rdma system set privileged-qkey on, the regular/on-path rdma binary does not recognize the parameter), but unfortunately, this did not change anything.

In privileged vs. unprivileged, the straces look identical until the EPERM occurs (for unpriv.):

Shared beginning/initialisation of program:

...
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 17
...
openat(AT_FDCWD, "/dev/infiniband/uverbs1", O_RDWR|O_CLOEXEC) = 14
...
setsockopt(17, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
bind(17, {sa_family=AF_INET, sin_port=htons(37218), sin_addr=inet_addr("192.168.88.1")}, 16) = 0
getsockname(17, {sa_family=AF_INET, sin_port=htons(37218), sin_addr=inet_addr("192.168.88.1")}, [16]) = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b10d80) = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b10c30) = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b109c0) = 0
...

Trace continues with errors when unprivileged (and gets killed)

...
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b10970) = -1 EPERM (Operation not permitted)
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b10a10) = 0
close(17)                               = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b11450) = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b11450) = 0
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b116e0) = -1 EINVAL (Invalid argument)
ioctl(14, RDMA_VERBS_IOCTL, 0x7ffc78b116f0) = -1 EINVAL (Invalid argument)
...
write(1, "EVT_ERROR_ENODEV: No such device"..., 33EVT_ERROR_ENODEV: No such device
...
+++ killed by SIGSEGV +++

Trace continues without errors when privileged (actual camera traffic starts)

...
ioctl(14, RDMA_VERBS_IOCTL, 0x7fff25c09b20) = 0
sendto(11, "B\1\0\202\0\10\08\0\0\260\4\0\0\0\1", 16, 0, {sa_family=AF_INET, sin_port=htons(3956), sin_addr=inet_addr("192.168.88.44")}, 16) = 16
pselect6(12, [11], NULL, NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1 (in [11], left {tv_sec=4, tv_nsec=999853234})
recvfrom(11, "\0\0\0\203\0\4\08\0\0\0\1", 12, 0, {sa_family=AF_INET, sin_port=htons(3956), sin_addr=inet_addr("192.168.88.44")}, [16]) = 12
mmap(NULL, 30552064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x79fd83000000
mlock(0x79fd83000200, 30551040)         = 0
write(1, "Triggering cameras\n", 19Triggering cameras
)    = 19
sendto(3, "B\1\0\202\0\10\09\0\0\260\10\0\0\0\1", 16, 0, {sa_family=AF_INET, sin_port=htons(3956), sin_addr=inet_addr("192.168.88.39")}, 16) = 16
pselect6(4, [3], NULL, NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=4, tv_nsec=999894404})
recvfrom(3, "\0\0\0\203\0\4\09\0\0\0\1", 12, 0, {sa_family=AF_INET, sin_port=htons(3956), sin_addr=inet_addr("192.168.88.39")}, [16]) = 12
...

I continue to look for a solution to the ioctl EPERM in the unprivileged container.