Increase SOMAXCONN in unprivileged containers?

osshelp · April 3, 2019, 9:32pm

Hello, all.

Is there a way to increase net.core.somaxconn in unprivileged containers?

Based on my short research this key was “namespaced” and very long time ago. So, it should be available for tuning. And it could be easily confirmed with “ip netns …” (change somaxconn, create new ns and change in it value as you want).
But when you’re using unprivileged container you also have separate user namespace. Using of userns breaks it for LXD-managed containers and even for Docker (w/enabled userns). And things will get even worse, when you read “man listen”:

If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before
2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

So, you end up with socket and 128 as default value for sockets backlog. People with Kubernetes found a workaround with privileged init-containers. For example this article - Bogdan Albei's blog: Kernel tuning in Kubernetes. It looks like a huge headache, but it fixes the problem.

Not sure, that I was digging at right direction, but ended up with these:

github.com

torvalds/linux/blob/63bdf4284c38a48af21745ceb148a087b190cd21/net/core/sysctl_net_core.c#L593


      
          
          	tbl = netns_core_table;
          	if (!net_eq(net, &init_net)) {
          		tbl = kmemdup(tbl, sizeof(netns_core_table), GFP_KERNEL);
          		if (tbl == NULL)
          			goto err_dup;
          
          		tbl[0].data = &net->core.sysctl_somaxconn;
          
          		/* Don't export any sysctls to unprivileged users */
          		if (net->user_ns != &init_user_ns) {
          			tbl[0].procname = NULL;
          		}
          	}
          
          	net->core.sysctl_hdr = register_net_sysctl(net, "net/core", tbl);
          	if (net->core.sysctl_hdr == NULL)
          		goto err_reg;
          
          	return 0;

github.com/torvalds/linux

net: Don't export sysctls to unprivileged users

committed 01:30AM - 19 Nov 12 UTC

ebiederm

+98 -4

In preparation for supporting the creation of network namespaces by unprivileged… users, modify all of the per net sysctl exports and refuse to allow them to unprivileged users. This makes it safe for unprivileged users in general to access per net sysctls, and allows sysctls to be exported to unprivileged users on an individual basis as they are deemed safe. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>

It looks like it was “broken” very long time ago and based commit’s commentary it was made intentionally (i.e. it’s not a bug). So, question still remains - is there a way to achieve security and performance? I.e. tune somaxconn w/out creating privileged container?

stgraber · April 4, 2019, 3:28am

It may be possible for LXD itself to modify such sysctls as real root in the container, the key would then need to be exposed through a limits. config key and code be added for LXD to perform that change.

stgraber · April 4, 2019, 3:28am

@brauner thoughts? Feels like we could have a LXD fork command which attaches to the netns, unshares a mntns, mounts proc and messes with sysctls, that should work for network sysctls.

osshelp · April 4, 2019, 3:37am

It would be great to have this. For example, Docker already has feature in their own way:

docker run --sysctl net.ipv4.ip_forward=1 someimage

Took it from here:

osshelp · April 4, 2019, 3:55am

To clarify - example above doesn’t work when you have enabled “userns-remap” option in dockerd (it enables the same behavior as LXD does with unprivileged containers, i.e. userns plus subuid/subgid). So, example from my previous comment just as a demonstration of Docker’s CLI option and corresponding section in “compose” config. Not sure that it could help at any way, but I hope that it would.

brauner · April 4, 2019, 4:26pm

You mean global sysctls for the whole system? That sounds like a bad idea imho.

osshelp · April 4, 2019, 4:47pm

Probably @stgraber meant “namespaced” sysctl keys. Can’t quickly find full list, but here is a small example between default and custom network namespace in meaning of sysctl keys.

Default network namespace:

root@hostname:~# sysctl net 2>/dev/null | wc -l
719

root@hostname:~# sysctl net.core.wmem_default
net.core.wmem_default = 212992

root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 128
root@hostname:~# sysctl net.core.somaxconn=129
net.core.somaxconn = 129
root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 129

Custom namespace:

root@hostname:~# ip netns add test
root@hostname:~# ip netns exec test bash
root@hostname:~# sysctl net 2>/dev/null | wc -l
402
root@hostname:~# sysctl net.core.wmem_default
sysctl: cannot stat /proc/sys/net/core/wmem_default: No such file or directory

root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 128
root@hostname:~# sysctl net.core.somaxconn=130
net.core.somaxconn = 130
root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 130

And default again:

root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 129

I hope the difference is clearly visible (inside of custom namespace you can see and tune only “namespaced” keys). And changes doesn’t affect the host system (correct me if I mistaken here).

And repeating my initial question: is there a way to tune net.core.somaxconn inside of unprivileged LXC container (in custom network namespace)?

brauner · April 4, 2019, 5:29pm

Every sysctl that is properly namespaced can be accessed in LXC through the

lxc.sysctl.{kernel parameter name}

key. So in your case you’d need to set:

lxc.sysctl.net.core.somaxconn = <value>

The question is whether somaxconn is namespaced or not. I don’t see it show up in an unprivileged container under /proc/sys/net/core/. So I doubt it is. If it is namespaced but people forgot to also namespace the sysctl then this is a kernel bug and should be fixed there.

osshelp · April 4, 2019, 6:31pm

The question is whether somaxconn is namespaced or not. I don’t see it show up in an unprivileged container under /proc/sys/net/core/ . So I doubt it is. If it is namespaced but people forgot to also namespace the sysctl then this is a kernel bug and should be fixed there.

As I show above, you can see it in separate network namespace (aslo as non-root user). But the main problem in using “users namespace”. As soon as you in unprivileged LXC-container or equal Docker-container w/enabled userns-remap - you can’t see this key. And as I wrote before, I tried to find out where or how it in kernel, but got lost deep in the sources.

This is exactly the reason why I initially asked “Is there a way …?”. Because I’m not so familiar with container’s bootstrap process (how and when namespaces/mounts/interfaces are created).

And right now we have only two options - to choose security (i.e. unprivileged) or performance (privileged+tuning). But after reading your article about risks with privileged containers (or even super-privileged in case of Docker) our doubts are even stronger. In the same time we can’t sacrifice performance and end up with containerized applications, which can’t even handle medium load. It’s a dead end as I see it, to be honest. To be clear - I’m not asking anyone to solve our problems. I’m just asking for advise about possible workaround in such controversial situation.

Thanks.

brauner · April 4, 2019, 10:46pm

So just to clear things up a little. We are dealing with two scenarios:

non-initial network namespace owned by the initial user namespace
non-initial network namespace owned by a non-initial user namespace

For 1. you can change somaxconn because you need to be privileged enough in the initial user namespace. For 2. you can’t change somaxconn because that would mean giving a non-initial user namespace the ability to affect a system-wide setting. This is a big no-no. So the real solution here to me seems to bump the somaxconn setting globally before starting any container to a large enough value that you don’t run into this issue.

osshelp · April 4, 2019, 11:12pm

Yep, we also tried as you described and it didn’t work. I.e. even if we bumped up value before creating container. We get “hardcoded” 128 all the time. Here is a small example:

root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 130

root@hostname:~# lxc init ubuntu:16.04 test
Creating test

root@hostname:~# lxc config set test security.privileged true
root@hostname:~# lxc config show test | grep privileged
  security.privileged: "true"

root@hostname:~# lxc start test
root@hostname:~# lxc exec test bash
root@test:~# sysctl net.core.somaxconn
net.core.somaxconn = 128

root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 130

So, it’s bump to 130 => create container => enter => see 128 instead of 130. And probably because of this:

github.com

torvalds/linux/blob/1a9df9e29c2afecf6e3089442d429b377279ca3c/include/linux/socket.h#L265


#define PF_ALG		AF_ALG
#define PF_NFC		AF_NFC
#define PF_VSOCK	AF_VSOCK
#define PF_KCM		AF_KCM
#define PF_QIPCRTR	AF_QIPCRTR
#define PF_SMC		AF_SMC
#define PF_XDP		AF_XDP
#define PF_MAX		AF_MAX


/* Maximum queue length specifiable by listen.  */
#define SOMAXCONN	128


/* Flags we can use with send/ and recv.
  Added those for 1003.1g not all are supported yet
*/


#define MSG_OOB		1
#define MSG_PEEK	2
#define MSG_DONTROUTE	4
#define MSG_TRYHARD     4       /* Synonym for MSG_DONTROUTE for DECnet */
#define MSG_CTRUNC	8

github.com

torvalds/linux/blob/63bdf4284c38a48af21745ceb148a087b190cd21/net/core/net_namespace.c#L343


	ops = saved_ops;
	list_for_each_entry_continue_reverse(ops, &pernet_list, list)
		ops_free_list(ops, &net_exit_list);


	rcu_barrier();
	goto out;
}


static int __net_init net_defaults_init_net(struct net *net)
{
	net->core.sysctl_somaxconn = SOMAXCONN;
	return 0;
}


static struct pernet_operations net_defaults_ops = {
	.init = net_defaults_init_net,
};


static __init int net_defaults_init(void)
{
	if (register_pernet_subsys(&net_defaults_ops))

PS: I made container privileged just to show “default” value in container’s namespace. As we discussed above, you can’t even see this key in unprivileged.

brauner · April 4, 2019, 11:39pm

Hm, you can try:

lxc config set <container-name> raw.lxc 'lxc.sysctl.net.core.somaxconn = 1000'

and restart a privileged container.

osshelp · April 5, 2019, 12:00am

And as expected it worked:

root@hostname:~# lxc init ubuntu:16.04 test
Creating test
root@hostname:~# lxc config set test security.privileged true
root@hostname:~# lxc config set test raw.lxc 'lxc.sysctl.net.core.somaxconn = 1000'
root@hostname:~# lxc start test
root@hostname:~# lxc exec test bash
root@test:~# sysctl net.core.somaxconn
net.core.somaxconn = 1000

But it’s not a problem. Anyone can achieve same thing just by setting same key via /etc/sysctl.conf at any distro. The initial problem still remains - is it possible to do the same for unprivileged? I still can’t understand the difference between unprivileged users in default namespace and same users in custom namespace. Here is what I’m talking about:

root@hostname:~# ip netns add test
root@hostname:~# ip netns exec test bash
root@hostname:~# sysctl net.core.somaxconn
net.core.somaxconn = 128
root@hostname:~# sudo -u nobody /bin/bash
nobody@hostname:~$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
nobody@hostname:~$ sysctl net.core.somaxconn
net.core.somaxconn = 128
nobody@hostname:~$ sysctl net.core.somaxconn=129
sysctl: permission denied on key 'net.core.somaxconn'

Yes, we can’t change the value as unprivileged user. But in the same time we still can even see it (even as non-root user). So, how it’s different from lxd as unprivileged user and his processes?

root@hostname:~# grep lxd /etc/subuid
lxd:165536:65536
root@hostname:~# ps auxf | grep -A1 '[t]est'
root      5799  0.0  0.0 205144  7160 ?        Ss   02:56   0:00 [lxc monitor] /var/lib/lxd/containers test
165536    5815  0.5  0.0  37472  5580 ?        Ss   02:56   0:00  \_ /sbin/init

When we’re in container, we’re in different user and network namespaces (plus more). But why we can’t even see the sysctl key, which awlays exists in any additional network namespace by default?

osshelp · April 5, 2019, 12:43am

Hmm … Looks like here is a difference:

root@hostname:~# ls -lA /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Apr  5 03:08 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 net -> net:[4026532009]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 pid_for_children -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 uts -> uts:[4026531838]

root@hostname:~# ip netns exec test bash
root@hostname:~# ls -lA /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Apr  5 03:08 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 mnt -> mnt:[4026533082]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 net -> net:[4026532889]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 pid_for_children -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr  5 03:08 uts -> uts:[4026531838]

root@hostname:~# ps auxf | grep -A1 '[t]est'
root      5799  0.0  0.0 205144  7160 ?        Ss   02:56   0:00 [lxc monitor] /var/lib/lxd/containers test
165536    5815  0.0  0.0  37472  5580 ?        Ss   02:56   0:00  \_ /sbin/init
root@hostname:~# ls -lA /proc/5815/ns
total 0
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 cgroup -> cgroup:[4026533078]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 ipc -> ipc:[4026532996]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 mnt -> mnt:[4026532994]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 net -> net:[4026532999]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 pid -> pid:[4026532997]
lrwxrwxrwx 1 165536 165536 0 Apr  5 03:05 pid_for_children -> pid:[4026532997]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 user -> user:[4026532993]
lrwxrwxrwx 1 165536 165536 0 Apr  5 02:56 uts -> uts:[4026532995]

Ordered as “default”, “only net namespace changed” and “all namespaces changed”. And it looks like as soon as you changed your net+user namespaces you instantly lost access to /proc/sys/net/core (it’s empty, but viewable).

brauner · April 5, 2019, 12:44am

That’s what I tried to explain before. Probably not clear enough, which is my bad. The kernel has a concept of ownership for various namespaces including network namespaces. When you ask the question “Am I allowed to perform an operation on this network namespace?”. The kernel will check who owns the network namespace in question. In the case of a privileged container the answer will be that the owner is the initial user namespace which is where global root lives. In the case of an unprivileged container the answer is that the owner is a non-initial user namespace which is not nearly as privileged as the initial user namespace. So the kernel will not allow you to even get near settings that are controlled by the initial user namespace.

osshelp · April 5, 2019, 12:53am

Probably I’m missing the difference between “why we can’t see net.core.somaxconn”, but we still can see and even change many other sysctl keys:

root@test:~# sysctl -e net 2>/dev/null | head -n 10
net.ipv4.conf.all.accept_local = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.arp_accept = 0
net.ipv4.conf.all.arp_announce = 0
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.all.arp_notify = 0
net.ipv4.conf.all.bootp_relay = 0
net.ipv4.conf.all.disable_policy = 0

root@test:~# sysctl net.ipv4.conf.all.accept_local=1
net.ipv4.conf.all.accept_local = 1

Even as unprivileged user, i.e. fake-root. So, how somaxconn is different? It’s a namespaced key and even if it changed, it won’t affect initial network namespace (i.e. host system).

PS: Maybe it’s a wrong place to ask such questions, but I’m still trying to wrap my head around all this.

brauner · April 5, 2019, 1:27am

I don’t know what the exact reasoning behind this is but the kernel definitely doesn’t let you access any net/core sysctls at all.

gpatel-fr · April 6, 2019, 3:50pm

well you found it already it’s just straight hardcoded in from
net/core/net_namespace.c:

static int __net_init net_defaults_init_net(struct net *net)
{
        net->core.sysctl_somaxconn = SOMAXCONN;
        return 0;
}

That’s the default value that is used and after that the sysctl comes and changes it. but not for containers, because of this - I tried to remove it and afterwards the root user of an unprivileged container could change the value for the container (and only the container - the host value is unchanged and other containers are not affected too)
In fact everything is almost perfect like this, the only part that could be better is an option to leave the ‘file’ /proc/sys/net/core/somaxconn owned by the ‘real’ root so it could be setup by the lxd manager (with the raw.lxc key) and only used by the container software. I tried that and it definitely works.

It seems definitely wrong to hide the key. It should be owned by the global root instead.

As of your problem, my guess is that if you need a way, you can always compile your own kernel and setup the default value SOMAXCONN higher. It will not really hurt standard software, just eat resources as the default queues may be higher (if the application software don’t set the value itself to a saner value). Maintenance headeache yes, but it beats a non working system.