Unprivileged container and capabilities

Hi,

Since LXD 3.0 no capabilities are dropped for unprivileged containers by default.

This was changed in commit ddab67a:

container_lxc: keep full capability set
Unprivileged container don’t need to drop any capabilities. The kernel will
enforce security for us.

I guess that it’s not “needed to drop any capabilities” in unprivileged containers because the unprivileged user running the container don’t actually have any capabilities on the host. Am I right?

My issue with this is that CAP_SYS_TIME is present in these containers and all the applications in it, telling them that they actually have the capability to modify time on this system - which is never true for a unprivileged container (until timens gets into the kernel, I guess?).

In my case this raises issues in an ansible role that base it’s decision on install/not install ntp-services to this capability. In my regard the capabilities should be the source of truth for this kind of decisions.

If I can’t trust the CAP_SYS_TIME in this aspect, what else can I use for decisions like this?

Thanks!

Yeah, the fact that you can have capabilities yet still not be allowed to do something is certainly confusing for a lot of userspace applications. User namespaces is certainly one case where that happens, though the same can be true with SELinux, Seccomp or AppArmor.

In most cases, the best way to figure those kind of potential restrictions out is to try to perform an action against the API. For CAP_SYS_TIME, it may be trying to do a time adjustement of a 0 offset and see whether that succeeds or not. This kind of solution, while not always practical to implement, does have the benefit of being rock solid.

Should you run such code against a kernel which then gains timens support, no change to the application would be required to then do the right thing in such an environment.

A maybe more practical approach in your case may be to use a simple in-container check, either as a condition in the relevant systemd units or by directly calling something like systemd-detect-virt and just skip ntp in such cases.