Iptables cgroup V2 marking with systemd

Hello All,
I am trying to get iptables to mark packets from a specific systemd slice but I am not having much luck.

Here are the the details :
I’m running LXD 4.6.
My Container is Fedora 32 unprivileged and “raw.lxc lxc.mount.auto = cgroup:rw:force” is set on it.
I can see the PID of the process in cgroup.procs under the cgroup for the service just fine.
I am using CGroup V2 and systemd version is 245.8.

From inside the container running

iptables -t mangle -A INPUT -m cgroup --path "system.slice" -j MARK --set-mark 10

it works fine for the entire system.slice.

When I try to change the slice on a service, for example, sshd by overriding the service unit file like
systemctl edit sshd —> then add

[Service]
Slice=sshd.slice # or system-sshd.slice

Then reboot the container and try

iptables -t mangle -A INPUT -m cgroup --path <no matter what path I type> -j MARK --set-mark 10

this fails with a dmesg error like such : xt_cgroup Invalid Path, errno=-2.

I am not sure what would be the correct “path” to include in this command ?
what would be the correct way to accomplish using cgroups to mark packets generated or related to specific processes in order to further process them in iptables/netfilter ?

Thank you in advance

I edited the post to make it easier to read.

1 Like

@stgraber may I get you to assist on this please ?

Thank you

More of a @brauner question. I wonder if that particular netfilter plugin is simply not aware of cgroup namespacing maybe?

Thank you @stgraber for the quick response.
I have learned a lot from all your Documentations, blogs and answers whether on GH or here which helped me on many many situations . I wanted to thank you for all the hard work and the effort to make LXC/D as stable and as flexible as it is now and all the improvements you and the other devs put in .

As for your response, if that’s the case, why does it seem happy with the CGroup root as path , i.e “system.slice” but unable to identify other subgroups or other CGroups inside the container ?
Also, I forgot to mention, this same iptables command when ran on the host, it works just fine without any errors about path, for example, launch a firefox or any application, find the application PID then
cat /proc/<PID>cgroup
, then use everything after “0::” as the path between double quotes for iptables cgroup path and it will work fine without issues, which seems to mean the plugin is aware of CGroup v2 and CGroup V1 net_cls as well .

I spent 3 days searching for an answer and I failed, so I am hoping the more knowledgeable devs or users here can help with this.
If this is an incompatibility issue, can we find out where is the culprit and perhaps suggest a patch or something ? After all, this is needed for accounting for processes, rate limiting, quotas and in general for security .

Thank you again .

system.slice is a valid cgroup on the host, so if that particular module isn’t namespace aware, it would only be allowing cgroups that exist outside of the container.

Out of interest, what are you trying to achieve, are you looking to apply specific rules to a container?

@stgraber yes it’s a default from systemd for the host and the container … so
system.slice --> on host --> Works as path :ballot_box_with_check:
system.slice --> in container --> works as path :ballot_box_with_check:
system.slice/system-sshd.slice --> on host --> Works as path :ballot_box_with_check:
system.slice/system-sshd.slice --> on container --> does not work as path :negative_squared_cross_mark:
system.slice inside the container is not the same as on the host … so by doing systemd-cgls inside the container and on the host I get different listings for the available slices … inside the container it only sees what the container systemd has created but the host cgroups/slices are not visible inside the container.
@tomp What I am trying to achieve is basically per process accounting and packet filtering, so for example, i would like for a service like sshd to accept connections from certain IPs and forward them based on firewall mark to different host while the rest of the connections coming in would go to a different host. For that I need to mark the packets on the mangle chain and process them later in the INPUT and nat chains. That’s a sample use case.
In General, I would like to make use of the slicing/CGroups of systemd and have some sort of per application/process filtering and accounting and rate limiting mechanism by allowing certain service to go out to certain IPs while allowing the rest of the system to go out to any IP for example. Like allowing Firefox to connect to US and EU websites only while chromium can connect to anywhere. I hope this helps explain what I am doing.
And yes, all the rules to be applied are inside the container.

Thank you .

@brauner may you assist on this please ? Any ideas from anyone if @brauner is busy ?

Thank you

Anyone able to help with this issue ?

I think this is a kernel bug:

struct cgroup *cgroup_get_from_path(const char *path)
{
        struct kernfs_node *kn;
        struct cgroup *cgrp;

        mutex_lock(&cgroup_mutex);

        kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
        if (kn) {
                if (kernfs_type(kn) == KERNFS_DIR) {
                        cgrp = kn->priv;
                        cgroup_get_live(cgrp);
                } else {
                        cgrp = ERR_PTR(-ENOTDIR);
                }
                kernfs_put(kn);
        } else {
                cgrp = ERR_PTR(-ENOENT);
        }

        mutex_unlock(&cgroup_mutex);
        return cgrp;
}

Always tries to lookup the path relative to the default cgroup root so this is all complete garbage as soon as you are in the container, i.e. even the “working” tagging for system.slice is broken. I’ll put this on my TODO.

@brauner thank you for taking the time to review. Should I open a GH issue for tracking purposes or no need ?

Thanks all for the help.

This is a kernel bug and there is no Github issue tracker for the Linux kernel.