I don’t expect this to be fixable, other than you specifying the size of your tmpfs in /etc/fstab.
Basically the kernel doesn’t care about the cgroup memory restrictions, they do apply and will kick in if you go over them, but anything that’s based on the amount of total system memory will usually look at the actual amount, not the cgroup amount.
There’s no DOS. You can only mount tmpfs as userns root and the oom-killer has taken you down once you’ve gone over your 2GB limit and that’s perfectly fine. Any privileged process on the host can do the same thing.
If you look at dmesg you’ll see:
The whole container doesn’t get killed, it’s just a normal OOM killer run restricted by the process list inside the cgroup.
The OOM killer never guaranteed it would kill the process which just went above the memory limit, it instead uses a score system to pick what process to kill. When a memory allocation fails due to running out of memory, the process with the most likely score will get killed. If not sufficient, the second most likely process gets killed, …
Interesting, that’s the default behaviour of tmpfs so it seems the kernel is directly handling this - and by default the kernel has no knowledge of containers.
I wonder if it could be possible to intercept this behaviour, maybe like the shm limitations of Docker (see for example Docker Engine API v1.40 Reference, shmsize parameter)
That’s partially right but misses cgroups. lxc exec attaches to the container and will - ignoring cgroup2 specialties which do not apply - move itself into init’s cgroup usually, i.e. it’ll attach to the same cgroup that init is usually in since init (systemd) will not move itself into a separate cgroup for the memory hierarchy:
So the oom killer sees a big fat cgroup and starts killing of tasks in the cgroup. But the memory killer will usually not just kill a single task it will kill multiple. So it starts with the fattest one which should be the one that goes over the memory limit but it then immediately finds the next fattest process which will usually be systemd-<some-daemon> and then another one almost guaranteed to hit systemd itself at some point taking down the container.