Setcap -v in unprivileged container

HGuillemet · October 14, 2020, 12:59pm

Hello,

On a Gentoo unprivilied container built with lxd on kernel 5.4.66, stored in ZFS:

# touch  test
# setcap cap_sys_admin=pe test
# setcap -v cap_sys_admin=pe test
nsowner[got=1000000, want=0],test differs in []

Is this normal behavior ?

On Gentoo, this kind of setcap verification is done when installing pam with filecaps use flag and the installation fails.

stgraber · October 14, 2020, 1:06pm

I suspect setcap is what’s broken here.

Assuming that root in your unprivileged container maps to 1000000 outside, it is absolutely correct for nsowner to be 1000000 on that file. It being 0 would actually be a security issue.

Can you test that file capabilities actually function in your container?

HGuillemet · October 14, 2020, 1:49pm

# cp /bin/ping /tmp
# su - x -c "/tmp/ping google.com"
/tmp/ping: socket: Address family not supported by protocol
# setcap cap_net_raw+ep /tmp/ping
# su - x -c "/tmp/ping google.com"
PING google.com (142.250.74.206) 56(84) bytes of data.

So it seems to work.

HGuillemet · October 14, 2020, 1:54pm

And to complete the test:

# setcap -v cap_net_raw+ep /tmp/ping
nsowner[got=1000000, want=0],/tmp/ping differs in []

is there anything else I can check ?

stgraber · October 14, 2020, 9:12pm

Ok, so looks good and indeed an indication that the issue may be with the test that setcap is doing.

HGuillemet · October 14, 2020, 10:30pm

I’ll file a bug with libcap

HGuillemet · October 26, 2020, 4:24pm

The people involved in libcap haven’t the solution yet: https://bugzilla.kernel.org/show_bug.cgi?id=209689

I saw also this older report that seems to conclude that there is something in the kernel config that can explain this.

Any idea or advice ?

stgraber · October 26, 2020, 4:48pm

Recent kernels allow for capabilities to be used inside of user namespaces, when that happens, the uid of root (0) inside the user namespace is stored as part of the v3 capability format in the xattr.

That’s the nsowner you see in your output which is 1000000 in your case, indicating that uid 0 in your container is real user 1000000 outside of it.

So the kernel is working as expected there.

HGuillemet · October 26, 2020, 5:07pm

Ok, thanks for the explanation.
But how 2 people using same libcap version and same filesystem can get different results ?

Also why containers have access to the real user id ? Is this not the goal of lxcfs to hide such information ?

stgraber · October 26, 2020, 5:22pm

The setcap/getcap kernel calls behave as expected, in this case it looks like libcap is going one step further and validating the on-disk xattr which does indeed include that id.

The kernel doesn’t mangle xattrs when read from within the container, so long as you’re allowed to read the xattr, you see its raw unmodified value.

I suspect libcap will need to learn that and if it sees a nsowner that’s not 0, then check whether the nsowner matches root in the current user namespace (which it can do by parsing /proc/self/uid_map).

HGuillemet · October 26, 2020, 5:33pm

Serge Hallyn says that, on his machine, the host finds v3 capabilities on a container file, while the container sees v2 capabilities only. That may well be why it works as expected for him.

What can explain this difference ?

HGuillemet · October 26, 2020, 5:46pm

From capabilities(7):
“Correspondingly, when a version 3 security.capability attribute is retrieved (getxattr(2)) by a process that resides inside a user namespace that was created by the root user ID (or a descendant of that user namespace), the returned attribute is (automatically) simplified to appear as a version 2 attribute (i.e., the returned value is the size of a version 2 attribute and does not include the root user ID). These automatic translations mean that no changes are required to user-space tools (e.g., setcap (1) and getcap (1)) in order for those tools to be used to create and retrieve version 3 security.capability attributes.”

Obviously this does not work for me.

stgraber · October 26, 2020, 5:55pm

Yeah, the kernel handles the normal getcap case. There are ways to get to the low level v3 cap struct, but stracing getcap, it seems to be doing the right thing here at least.

root@shell01:~# setcap cap_net_raw=pe a
root@shell01:~# setcap -v cap_net_raw=pe a
a: OK
root@shell01:~# uname -a
Linux shell01 5.4.0-40-generic #44~18.04.1-Ubuntu SMP Wed Jun 24 23:13:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

What version of libcap are you running?

HGuillemet · October 26, 2020, 5:56pm

I tried both 2.43 and 2.44

stgraber · October 26, 2020, 6:00pm

ok, here I’m on 2.25 and 2.32, I’ve got a gentoo container running emerge now to get it installed so I can see if the newer versions are what’s getting confused somehow or if it’s a kernel issue.

HGuillemet · October 26, 2020, 6:01pm

Have a look at the linked bugzilla report: Serge Hallyn posted a C source to tell which cap version we see, and in my case I see v3 within the container. If I understand correctly the manpage, I should see v2.

stgraber · October 26, 2020, 6:10pm

gentoo ~ # touch a
gentoo ~ # setcap cap_net_raw=pe a
gentoo ~ # setcap -v cap_net_raw=pe a
a: OK

here on 2.43 gentoo with an Ubuntu 5.8 kernel.

@hallyn seems likely to be a kernel issue then?

HGuillemet · October 27, 2020, 11:18am

Some news:
CONFIG_SECURITY was unset. When set, containers see v2 caps and setcap -v works as expected.

If I remember correctly, lxc-checkconfig didn’t spot this missing config.

HGuillemet · December 14, 2020, 10:33pm

A fix has been added to kernel 5.10. This topic can be closed.

HGuillemet · January 2, 2021, 9:27pm

For the record, the fix has also been included in 5.4.86, 4.19.164 and 4.14.213.
CONFIG_SECURITY shouldn’t be necessary anymore for these versions.