Weird memory behavior of ElasticSearch inside LXC container

ruskofd · July 5, 2021, 1:21pm

Hello everyone,

This WE, I played with LXC/LXD containers and ElasticSearch inside my homelab for sake of learning.

So I use as usual Fedora as “OS” for the container, I use limits for memory (6GB) and CPU (4vCPU) and I setup a single node of Elasticsearch, but once I want to start the service, it fails miserably. I looked to find some clue on why it failed and I found the process seems to be OOM-killed, it looks like the JVM memory heap allocation was completely borked. I checked the process closely and it allocate “shit-load” of memory. I guess it tried to size the heap according to my host memory (64GB of RAM).

I didn’t understand, so I tried a virtual machine with the same hardware limits and it works perfectly fine, the JVM has well adjusted its heap, so WTF…

I continue to search in order to find an explaination but no luck until I change a setting, the famous security.nesting that I use because of the recent changes in systemd concerning sandboxing settings applied to systemd-networkd unit (and others too I guess). Once I changed this setting to false and reboot my container, the Elasticseach process start without any problems, the heap allocation looks fine

I checked the /proc/meminfo file to check LXCFS “virtualization” of memory values and it looks fine, the limits were correct.

Is it something intended to have process which can “override” their view of /proc when using security.nesting ? I found issues on Github about this kind of behavior with Docker with this setting applied, but why when this parameter is not set, the application seems to behave correctly ? I could read that JVM use the meminfo file, so it is very confusing

Any ideas ?

Thanks

stgraber · July 5, 2021, 7:43pm

It’s possible that the JVM is directly looking at /sys for the memory amount instead of parsing the cgroup or meminfo amount.

When that happens you often can directly pass additional arguments to the JVM through a /etc/default/XYZ type file which then lets you restrict it to a value suitable for the container.

It’s always a bit annoying when you find software that does that because it feels like they’re jumping through a large number of hoops to end up with a value which is completely pointless as soon as any kind of restriction is applied on the process (whether it’s because it’s a container or because it’s running with a memory limit through systemd).

ruskofd · July 5, 2021, 9:13pm

That make sense indeed. Forcing JVM memory is an official recommendation for running ES in containers, it’s fine for me. But the behavior change when enabling nesting mode bother me. I’m curious on how things works that’s why I dig the question

Shouldn’t it only apply a less strict AppArmor policy to allow some apps like LXD or Docker to run by allowing them to do some specific mounts ?

In LXC man pages, it says this about nesting mode :

If set this to 1, causes the following changes. When generated apparmor profiles are used, they will contain the necessary changes to allow creating a nested container. In addition to the usual mount points, /dev/.lxc/proc and /dev/.lxc/sys will contain procfs and sysfs mount points without the lxcfs overlays, which, if generated apparmor profiles are being used, will not be read/writable directly.

This means that once this mode is enabled, the LXCFS limits overlays in the procfs and sysfs of the container are not applied right ? This confuse me again, because I see my limitations if I look for them through meminfo or cpuinfo (if I correctly understand the role of LXCFS to override these files once limits are applied). Is there something else to know about this ?

To summarize my “”“issue”"" (not a real one, pure curiosity) :

security.nesting to true : JVM “see” through some Linux dark magic all the host memory, therefore automatic sizing of the heap is wrong.
security.nesting to false : JVM correctly “see” the container limits and his heap is correctly sized.

So if I’m not crazy, something in procfs and/or sysfs is changed I tried to look through them but didn’t find an answer for now.

Thanks

stgraber · July 5, 2021, 9:50pm

security.nesting=true will do two main things:

Mount a hidden copy of /sys and /proc under /dev/.lxc (avoids overmounting checks in the kernel)
Relax apparmor rules

The combination of those two means that it’s now possible for something to mount a new copy of /proc and /sys which do not have lxcfs applied. I don’t know if ES does it directly or if systemd is used for that or something, but it could explain the difference.

ruskofd · July 6, 2021, 8:55am

Thanks for your answer

So, I will check on the JVM side of things to find informations about his relation with these pseudo-filesystems, I will try to find some useful informations and report here.

I really find this topic interesting to learn how everything works behind the hood as being a junior Linux system engineer in everyday life

ruskofd · July 13, 2021, 5:18pm

My problem is no more, since last week, the nesting option is not required anymore to run systemd-networkd without problems in official images. So I can run my ES cluster(s) in regular container without “weird” behaviours

However, I continued to search informations about JVM relation with proc and sys but no luck.

stgraber · July 14, 2021, 2:28am

Ah, good to hear that all our systemd trickery is paying off