LXC Container in Proxmox using 90% of memory with all processed killed

Devedse · March 16, 2024, 6:30pm

Hello everyone,

I’m currently running Plex inside a docker container inside an LXC container. The cluster is backed by CEPH storage. The issue I’m encountering though is that sometimes the LXC container runs out of memory. I have this happening for a few other LXC containers too.

The container has a memory limit of 8gb but after a while it gets closer and closer to this limit until it eventually runs out. When this happens I can’t even connect to the container anymore and have to force stop it.

Today when I wanted to investigate the issue a bit further I noticed my Plex LXC was filling up again. So I logged in and started stopping all processed.
I first manually stopped all containers, and then also stopped docker and containerd using systemctl stop.

The strange thing is though that after that the memory usage in both Proxmox and htop still reported 7gb/8gb used:

(I can only add one image, so I’ll add the other one later)

What is weird though is that none of the processes in htop show any significant memory usage.

The command free -m also shows that about 7gb is used:

root@lxc-plex:~/dockercomposers/plexplox# free -m 
               total        used        free      shared  buff/cache   available
Mem:            8192        7129         769           0         292        1062
Swap:              0           0           0

Next I ran ps aux to find out the memory usage per process:

root@lxc-plex:~/dockercomposers/plexplox# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 168800  8320 ?        Ss   Mar13   0:26 /sbin/init
root          43  0.0  0.2  74308 20608 ?        Ss   Mar13   0:02 /lib/systemd/systemd-journald
systemd+      82  0.0  0.0  17996  3840 ?        Ss   Mar13   0:00 /lib/systemd/systemd-networkd
root         114  0.0  0.0   3600   640 ?        Ss   Mar13   0:00 /usr/sbin/cron -f
message+     115  0.0  0.0   9296  2176 ?        Ss   Mar13   0:02 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         119  0.0  0.0  17164  2688 ?        Ss   Mar13   0:00 /lib/systemd/systemd-logind
root         124  0.0  0.0   2516   640 pts/0    Ss+  Mar13   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud - 115200,38400,9600 linux
root         125  0.0  0.0   6120  1152 pts/1    Ss   Mar13   0:00 /bin/login -p --
root         126  0.0  0.0   2516   512 pts/2    Ss+  Mar13   0:00 /sbin/agetty -o -p -- \u --noclear - linux
root         132  0.0  0.0  15412  1920 ?        Ss   Mar13   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root         287  0.0  0.0  42652   788 ?        Ss   Mar13   0:00 /usr/lib/postfix/sbin/master -w
postfix      289  0.0  0.0  43088   896 ?        S    Mar13   0:00 qmgr -l -t unix -u
root        3367  0.0  0.0   6632  3712 pts/1    S    Mar13   0:00 -bash
postfix   519708  0.0  0.0  43052  6400 ?        S    18:09   0:00 pickup -l -t unix -u -c
root      519711  0.0  0.0   8088  4096 pts/1    R+   18:25   0:00 ps aux

ps aux again suggest that there’s barely any memory usage.

Another interesting one is cat /proc/meminfo:

root@lxc-plex:~/dockercomposers/plexplox# cat /proc/meminfo 
MemTotal:        8388608 kB
MemFree:          788216 kB
MemAvailable:    1087576 kB
Buffers:               0 kB
Cached:           299360 kB
SwapCached:            0 kB
Active:          6515140 kB
Inactive:         444628 kB
Active(anon):    6337184 kB
Inactive(anon):   323328 kB
Active(file):     177956 kB
Inactive(file):   121300 kB
Unevictable:           0 kB
Mlocked:          221280 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       6660408 kB
Mapped:                0 kB
Shmem:               104 kB
KReclaimable:     787484 kB
Slab:                  0 kB
SReclaimable:          0 kB
SUnreclaim:            0 kB
KernelStack:       20336 kB
PageTables:        41200 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16314548 kB
Committed_AS:   16080536 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      253936 kB
VmallocChunk:          0 kB
Percpu:             4448 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      372040 kB
DirectMap2M:    15124480 kB
DirectMap1G:    17825792 kB

This lists about 6.5gb and 6.3gb respectively for Active and Active(anon).

Even with all this information I have no clue what is using the memory inside this LXC container.

One vague idea I have is that it might have something to do with me mapping the Intel N100 chip to the container so that Plex can do hardware transcoding, but again, this might be something that’s completely unrelated.

Here’s my LXC config /etc/pve/lsx/201.conf:

root@proxmox1:/etc/pve/lxc# cat 201.conf 
arch: amd64
cores: 4
features: nesting=1
hostname: lxc-plex
memory: 8192
mp0: /mnt/lxc_shares/Plex/,mp=/mnt/Plex,shared=1
net0: name=eth0,bridge=vmbr0,gw=10.88.20.254,hwaddr=8E:48:71:B7:12:98,ip=10.88.21.201/23,type=veth
onboot: 1
ostype: debian
rootfs: ReplicatedPool_2:vm-201-disk-0,size=100G
swap: 512
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

I know I could just restart the container and have things resolved again but I think I have a beautiful scenario now for debugging. Let’s hope I can get some input here to figure out what’s going wrong.

Note: I also crossposted this issue here:
https://forum.proxmox.com/threads/lxc-container-in-proxmox-using-90-of-memory-with-all-processed-killed.143415/

Devedse · March 16, 2024, 6:31pm

And a screenshot of htop:

stgraber · March 17, 2024, 11:54pm

Got any tmpfs or devshm mounted with stuff on it?

Devedse · March 18, 2024, 11:13am

Forgive me for not being that well versed in the core workings of Linux / memory management. So I’m trying to answer your question by my own interpretation. (I’m kinda learning on the job here )

Reproducing the problem

As of making this post I have accidently restarted the LXC container which resulted it in being nice and empty again in terms of memory usage. I left it running for about 12 hours without plex running inside of it and that all kept it at like 100mb of memory. (Due to some other containers running).

What I noticed though is that once started Plex again and I start watching a movie on Plex with transcoding enabled it seems that it allocates some memory, but doesn’t fully release it.

For example, when I started the Plex container the whole system was using around 1gb of memory. When I then started a movie everything stayed at around the same amount. When I then started transcoding the movie (using my hardware mapped /dev/dri device) the memory usage increased by around 100-200mb. When I stopped the transcoding session this memory kept being used.

By doing this a few times (stopping / starting transcoding for a movie) I’m now sitting at 3gb used:

root@lxc-plex:/run# free
               total        used        free      shared  buff/cache   available
Mem:         8388608     3178884     3468652      390760     1741072     5209724
Swap:              0           0           0

I then stopped the Plex container again and ran free again:

root@lxc-plex:~/dockercomposers/plexplox# docker compose down
[+] Running 3/3
 _ Container plex            Removed                                                                                                                                                                             7.8s 
 _ Container tautulli        Removed                                                                                                                                                                             2.8s 
 _ Network plexplox_default  Removed                                                                                                                                                                             0.5s 
root@lxc-plex:~/dockercomposers/plexplox# free   
               total        used        free      shared  buff/cache   available
Mem:         8388608     2998296     4232132         112     1158180     5390312
Swap:              0           0           0

As you can see we seem to be running into the same problem again as before (only 3gb used instead of 8 because the container has only been running for an hour or so).

Back to the question

Anyway, with the problem reproduced I’d now like to go back to your question.

if I run df -h inside the container I see that I have 3 tmpfs filesystems mounted:

root@lxc-plex:~/dockercomposers/plexplox# df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/rbd0               99G   44G   50G  47% /
//************/PlexPlox   63T   41T   23T  65% /mnt/PlexPlox
none                   492K  4.0K  488K   1% /dev
udev                    16G     0   16G   0% /dev/dri
tmpfs                   16G     0   16G   0% /dev/shm
tmpfs                  6.3G  108K  6.3G   1% /run
tmpfs                  5.0M     0  5.0M   0% /run/lock

I don’t see significant usage here on tmpfs or shm mounts. So I don’t expect that to be the problem.

Some thoughts about the issue seeming to happen when transcoding

Currently I’m using an N100 CPU with hardware acceleration for video transcoding. I have however enabled SRIOV for this CPU so that I could (Not using this at the moment) also map virtual GPU’s to VM’s.
As you can see in the LXC config I map all devices in /dev/dri to the LXC container. I then mount /dev/dri/renderD128 to the Plex container.
The guide I followed for that is here:

Could it be that somehow the SRIOV implementation screws up something which causes this problem?

I would like to try to turn off SRIOV for now but I’m not sure how to “uninstall” / “disable” it. I’ve already asked on their github to see if I can get some help there:

github.com/strongtz/i915-sriov-dkms

Uninstalling / disabling

opened 06:40PM - 16 Mar 24 UTC

devedse

Hello everyone, Pardon me for maybe being a noob in DKMS modules, but for tes…ting purposes on my machine I'd like to completely disable the sr-iov module. (From what I read is that I don't actually need this for hardware acceleration inside LXC containers). So I'm looking for a guide to either completely uninstall it, or maybe a simpler version, to temporarily disable it. (I can try some things myself but I want to know if for example modifying the boot parameters is good enough to disable it or if I should also uninstall the DKMS module)

(So if by any chance you know how to do that, that would be helpfull as well so I can continue my investigation)

Devedse · March 18, 2024, 2:03pm

I did another test. My Plex LXC (without the Plex Docker Container running) was using 3.7gb.

When I monitored the memory usage on the Host it was sitting at around 17.3gb.

I then completely shutdown the LXC container.

After that the memory usage on the host still remained around 17.3gb.

Devedse · March 19, 2024, 3:38pm

I found an update on the SRIOV Github that more people are running into an issue with memory not being released:

github.com/strongtz/i915-sriov-dkms

Kernel dynamic memory is not released again

opened 12:06PM - 02 Feb 24 UTC

makoONE

Since I have been using PVE 8.1 with kernel 6.5, I have noticed for some time th…at the kernel dynamic memory is not released again. Whenever I start a VM that has allocated the GPU and shut it down again, the host's memory display remains at about the same value as if the VM was still running. A check with smem shows that the memory is no longer allocated by any processes but to the kernel dynamic memory. With PVE 8.0 and kernel 6.2 I never experienced the described behavior. Is anyone else here affected or knows a solution?

I’m going to investigate if disabling the DKMS module solves the issue for me too.

Devedse · March 23, 2024, 2:55pm

My issue is solved by uninstalling the DKMS SRIOV driver for the N100 CPU.
(It seems to have a memory leak on kernel 6.5, see the github post above)