Shipping systemd patches for nixOS

I’ve recently taken a stab at trying to make nixOS work in LXD unprivileged by adding some systemd patches from the ubuntu systemd.

Currently I have those patches applied

  • Revert “namespace: be more careful when handling namespacing failures gracefully”
  • units: block CAP_SYS_MODULE units in containers too

Are those all patches? Are more required?

There’s still a warning about systemd being unable to attach BPF egress, but it’s a softfail, so the service seems to still start.

Additionally another maintainer asked if there’s PRs to ship the patches upstream. If so, could I have a link for those PRs?

Hmm, we’re not really involved with Ubuntu’s systemd and weren’t aware of any critical issue (just logging noise) in the variety of distros we build images for.

Indeed a bunch of things will fail in unprivileged containers but those do end up being softfail in most cases (ebpf, devices cgroup, some capabilities, some socket types, …).

We recently started to use a global systemd override file to workaround a bunch of such issues: https://github.com/lxc/distrobuilder/commit/6023d706270036eafd0cc384868f9723be812f66

Maybe something like that could be of use for you too.

Starting an arch container

lxc launch images:archlinux arch-test
lxc exec arch-test bash

Then systemctl status systemd-networkd.service gives an error about namespaces

Similar for nixOS without the patch. It’s clearly not softfail. Ubuntu works tho, because of the patches, specifically the “Revert 'namespace:…” which makes that exact error a softfail

× systemd-networkd.service - Network Service
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/systemd-networkd.service.d
             └─lxc.conf
     Active: failed (Result: exit-code) since Tue 2021-04-27 19:28:50 UTC; 48s ago
TriggeredBy: × systemd-networkd.socket
       Docs: man:systemd-networkd.service(8)
    Process: 43 ExecStart=/usr/lib/systemd/systemd-networkd (code=exited, status=226/NAMESPACE)
   Main PID: 43 (code=exited, status=226/NAMESPACE)

Apr 27 19:28:50 arch-test systemd[1]: systemd-networkd.service: Scheduled restart job, restart counter is at 5.
Apr 27 19:28:50 arch-test systemd[1]: Stopped Network Service.
Apr 27 19:28:50 arch-test systemd[1]: systemd-networkd.service: Start request repeated too quickly.
Apr 27 19:28:50 arch-test systemd[1]: systemd-networkd.service: Failed with result 'exit-code'.
Apr 27 19:28:50 arch-test systemd[1]: Failed to start Network Service.

I’m not 100% sure if it’s nixOS way of packaging lxd (which looks fairly standard to me, but we could be missing a crucial workarround for this in our package) or if it really is the systemd patch ubuntu applies on their systemd.

Judging from the fact that 21.04 still has it, it doesn’t look like it got merged upstream.

DId you try the override file in the commit I linked above?

It seems to ship with it already

Hmm, indeed. What’s the systemctl cat systemd-networkd.service output?

https://termbin.com/eunn

It’s unmodified images:archlinux running unprivileged so you should be able to reproduce locally. Does it work for you with ubuntu lxd or is there an error, too? (just so I know if it’s the nixOS package or a general error)
I’m using lxd-4.13

tested on 20.04 in virtualbox, error is the same

The problem is /etc/systemd/system/systemd-networkd.service.d/lxc.conf, delete it and things should behave, I’ll send a PR.

deleted that, systemd-networkd works now but noticed systemd-resolved is still broken, breaking DNS

systemctl cat systemd-resolved | nc termbin.com 9999
https://termbin.com/u3da

systemctl status systemd-resolved

× systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/service.d
             └─lxc.conf
     Active: failed (Result: exit-code) since Tue 2021-04-27 22:19:47 UTC; 1min 38s ago
       Docs: man:systemd-resolved.service(8)
             man:org.freedesktop.resolve1(5)
             https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
             https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
    Process: 51 ExecStart=/usr/lib/systemd/systemd-resolved (code=exited, status=226/NAMESPACE)
   Main PID: 51 (code=exited, status=226/NAMESPACE)

Apr 27 22:19:47 arch-test systemd[1]: systemd-resolved.service: Scheduled restart job, restart counter is at 5.
Apr 27 22:19:47 arch-test systemd[1]: Stopped Network Name Resolution.
Apr 27 22:19:47 arch-test systemd[1]: systemd-resolved.service: Start request repeated too quickly.
Apr 27 22:19:47 arch-test systemd[1]: systemd-resolved.service: Failed with result 'exit-code'.
Apr 27 22:19:47 arch-test systemd[1]: Failed to start Network Name Resolution.

Some more broken ones

[root@arch-test ~]# systemctl --failed
  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION            
● systemd-hostnamed.service     loaded failed failed Hostname Service
● systemd-resolved.service      loaded failed failed Network Name Resolution
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
3 loaded units listed.

Adding ProtectKernelTunables=no seems to help, pushing that to our override.

With that, I only get systemd-journald-audit.socket which fails to start and that’s normal, that one will never be allowed to start in a container, might be worth getting upstream to add a container condition.

This patch from ubuntu seems to take care of journald and should be pushed to upstream systemd.

Then everything should be working just right.

From: Dimitri John Ledkov <xnox@ubuntu.com>
Date: Wed, 2 Aug 2017 00:40:28 +0100
Subject: units: set ConditionVirtualization=!private-users on journald audit
 socket

As it fails to start in an unpriviledged container.
---
 units/systemd-journald-audit.socket | 1 +
 1 file changed, 1 insertion(+)

diff --git a/units/systemd-journald-audit.socket b/units/systemd-journald-audit.socket
index f0c0aeb..07b2e49 100644
--- a/units/systemd-journald-audit.socket
+++ b/units/systemd-journald-audit.socket
@@ -14,6 +14,7 @@ DefaultDependencies=no
 Before=sockets.target
 ConditionSecurity=audit
 ConditionCapability=CAP_AUDIT_READ
+ConditionVirtualization=!private-users
 
 [Socket]
 Service=systemd-journald.service

Yeah, no good reason for that one not to be upstream, audit sockets are specifically not allowed in user namespaces and because of how caps work in userns, the cap condition doesn’t actually help.

Feel free to send this to upstream systemd, I don’t see a reason why they wouldn’t easily pick it up.

PR’s out.

Removing the networkd lxc.conf is via distrobuilder, so with the new images that’s resolved?

The ProtectTuntables=No lxc.conf seems to be via lxd, so that would come in a new lxd release - or is that also distrobuilder?

Btw, I’m prob adding a nixOS image to distrobuilder after this, so prepare for that :wink:

Btw, for the future if I may I’d recommend a simple automatic test with all systemd-distros to check the output of systemctl --failed for any errors and fail the test if that happens.

The workaround so far was security.nesting=true, but that’s just a workaround.

The fixes were in distrobuilder and in lxc-ci, they’ve all been pushed so the next round of images should get the changes. Hopefully they don’t regress anything else.