Memory usage grow on debian9 containers

Hi,

We are actually using some debian10 lxd hosts (lxd is installed via snapd) with no issue.
We are actually running more than 50 debian10 containers with no issue (except the journald tmpfs usage issue we fixed recently).
We are using kernel 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 and lxd 4.6

We had to create couple of debian9 containers on them and encounter an issue with them both.
We can see these containers memory usage growing for no reason over time until there is no more memory left on the container.
We are using images:debian/9/cloud images from lxd official repository.
We didn’t tweak anything.

Memory used is not cache or buffer. We stopped all non-system process on containers but still memory is used by something and we have no clue what.

On both the usage is increasing very regularly and they both use their 2G memory after 19 hours running.
Only processes running on them are

# ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 20:44 ?        00:00:06 /sbin/init
root        47     1  0 20:44 ?        00:00:00 /lib/systemd/systemd-journald
root       155     1  0 20:44 ?        00:00:00 /sbin/dhclient -4 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
message+   211     1  0 20:44 ?        00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       218     1  0 20:44 ?        00:00:00 /lib/systemd/systemd-logind
root       228     1  0 20:44 console  00:00:00 /sbin/agetty --noclear --keep-baud console 115200,38400,9600 linux
root       237     1  0 20:44 ?        00:00:00 /usr/sbin/sshd -D
Debian-+   520     1  0 20:44 ?        00:00:00 /usr/sbin/exim4 -bd -q30m
root       528   237  0 20:44 ?        00:00:00 sshd: root@pts/0
root       530     1  0 20:44 ?        00:00:00 /lib/systemd/systemd --user
root       531   530  0 20:44 ?        00:00:00 (sd-pam)
root       540   528  0 20:44 pts/0    00:00:00 -bash
root       978   540  0 20:53 pts/0    00:00:00 ps -eaf

See below our memory usage graph
image

Do you have any idea what could go wrong on our debian9 containers ?

What process/es is/are using the RAM?
That should be determinable.

That’s exactly the point… top or htop are not showing any process using all the ram… If I sum-up all the process RSS used, I get almost the same amount of RAM used at the container startup (almost 300Mb).
I have no clue what is actually using all this RAM.

Also I do not see any tmpfs filesystem used more than some megabytes (see below my df -h output)

# df -h
Filesystem                                 Size  Used Avail Use% Mounted on
/dev/vg01/containers_test  9.4G  3.4G  6.0G  37% /
none                                       492K  4.0K  488K   1% /dev
udev                                        16G     0   16G   0% /dev/tty
tmpfs                                      100K     0  100K   0% /dev/lxd
tmpfs                                      100K     0  100K   0% /dev/.lxd-mounts
tmpfs                                       16G     0   16G   0% /dev/shm
tmpfs                                       16G   24M   16G   1% /run
tmpfs                                      5.0M     0  5.0M   0% /run/lock
tmpfs                                       16G     0   16G   0% /sys/fs/cgroup
tmpfs                                      191M     0  191M   0% /run/user/0
tmpfs                                      191M     0  191M   0% /run/user/1012

I rebooted container 4 hours ago, it started with 280 Mb RAM used, and is slowly increasing. It is actually using 480Mb.
See top output below

top - 16:09:36 up  4:13,  1 user,  load average: 0.50, 0.60, 0.63
Tasks:  21 total,   1 running,  20 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.3 us,  0.0 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1953124 total,  1267632 free,   479292 used,   206200 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  1473832 avail Mem 

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1 root      20   0   58724   8104   5120 S  0.0  0.4   0:07.11 systemd
47 root      20   0   62488  11812  11320 S  0.0  0.6   0:04.52 systemd-journal
155 root      20   0   20484   2868   1824 S  0.0  0.1   0:00.00 dhclient
213 message+  20   0   45132   3744   3284 S  0.0  0.2   0:04.31 dbus-daemon
216 root      20   0   30884   2364   2080 S  0.0  0.1   0:00.02 cron
226 root      20   0  180480   4740   2300 S  0.0  0.2   0:00.46 rsyslogd
236 root      20   0   46976   4924   3916 S  0.0  0.3   0:03.59 systemd-logind
246 root      20   0   69960   6092   5320 S  0.0  0.3   0:00.65 sshd
290 ntp       20   0  102112   3788   3236 S  0.0  0.2   0:01.72 ntpd
297 root      20   0   15520   1552   1416 S  0.0  0.1   0:00.00 agetty
508 Debian-+  20   0   56152   2920   2188 S  0.0  0.1   0:00.00 exim4
546 root      20   0   56412   5272   4604 S  0.0  0.3   0:00.06 systemd
547 root      20   0   84708   1780     20 S  0.0  0.1   0:00.00 (sd-pam)

What are the lines on your graph (there is no key)?

Yes you’re right, I cropped it a little bit too much…
See below complete graph since the last reboot

Thats very interesting :thinking:.
I guess @tomp will be more helpful than me, but…

Are there “suspicious” activities on the host?

So you didn’t install or modify anything at all?

I assume you have limited their memory via LXD?

Have tried to run a new debian container (by which I mean a new build of debian9 container)?
And have you tried the non-cloud variant?
(Both for reasons of comparison obviously.)

1 Like

Been diving a bit in the topic of memory:
Someone recommended to use smem to report accurate memory usage.

Quote:

smem can report proportional set size (PSS), which is a more meaningful representation of the amount of memory used by libraries and applications

1 Like

I didn’t see anything suspiscious on containers.
I have created 2 containers the same way on 2 different hosts (debian10 hosts + debian 9 containers each time) and was able to reproduce issue on all containers.

Yes I have created containers using lxd and have only set CPU and RAM limits using lxd profiles.
I also have setup network and some dhcp conf using user.network-config and user.user-data.

I didn’t try the non cloud variants yet.
I will spend some time during the next days on this subject and will try the non-cloud variants :wink:
Also note that the 2 containers I created were monitored using shinken by ssh so there was regularly ssh connections opened in the container.
I also tested same configuration on virtual machines created with lxd still on same hosts and with debian/9/cloud image and issue is not happening at all.

Just tested smem quickly on the only remaining debian9 container I have which is also a jenkins slave server so with a java process running on it (and some go binaries like consul).

I will add smem to my tests I will do tomorrow and the next days :wink:

free is actually reporting 917Mb RAM used on 1.9Gb.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           1907         917         543          35         445         989
Swap:             0           0           0

smem is reporting like all others way less memory used than kernel

# smem
  PID User     Command                         Swap      USS      PSS      RSS 
  312 root     /sbin/agetty --noclear --ke        0      196      277     1664 
  230 root     /usr/sbin/cron -f                  0      328      448     2196 
 1095 jenkins  (sd-pam)                           0      616     1184     3108 
  575 root     (sd-pam)                           0      784     1264     3100 
 1101 jenkins  sshd: jenkins@notty                0      320     1357     4720 
  290 ntp      /usr/sbin/ntpd -p /var/run/        0     1296     1451     3044 
  232 messagebus /usr/bin/dbus-daemon --syst        0     1156     1465     3704 
  574 root     /lib/systemd/systemd --user        0      844     1739     5948 
  525 Debian-exim /usr/sbin/exim4 -bd -q30m          0     1684     1766     3100 
 1092 root     sshd: jenkins [priv]               0      212     1801     6596 
 1094 jenkins  /lib/systemd/systemd --user        0      908     1808     6080 
  255 root     /usr/sbin/sshd -D                  0      788     1830     5928 
  155 root     /sbin/dhclient -4 -v -pf /r        0     1832     1887     2940 
  249 root     /lib/systemd/systemd-logind        0     1512     2011     5336 
19531 root     -bash                              0     1904     2180     3752 
19525 root     sshd: root@pts/0                   0     1040     2390     6976 
  212 root     /usr/sbin/rsyslogd -n              0     3368     3475     5104 
    1 root     /sbin/init                         0     3524     4765     9616 
   45 root     /lib/systemd/systemd-journa        0     7148     7569    10708 
20204 root     /usr/bin/python /usr/bin/sm        0     8460     8757    10352 
  219 consul   /opt/bin/consul-template -c        0    19900    19900    19904 
  215 telegraf /usr/bin/telegraf -config /        0    57504    57559    58620 
  253 consul   /opt/bin/consul agent -conf        0    70384    70384    70388 
 1117 jenkins  java -Dfile.encoding=UTF-8         0   185356   185549   187240 

top and htop are reporting same values than smem

As you can see if we add all RSS values, it gives 447Mb and not 917.

Ok, thats valuable information.
Suggests that it has something to do with the specific (container) image or LXC :thinking:.
But as always: correlation is not necessarily causation.

Would also be interesting to see whether a different kernel version works.
Or at least a more recent 4.19 version, current is 4.19.152.

I guess the team will come back to you.

Still thank you for your time :wink:

I did some tests today. I have created 2 new containers (1 cloud + 1 default image) with minimum changes inside to check if I can reproduce on them.

I found something interesting, on my lxd hosts (ovh servers with 32Gb RAM) I can see similar memory leak.
Example: I have one host with 32Gb of RAM hosting 10 containers with 1Gb RAM limit (and they are all not using their complete quota) and 2 qemu virtual machines (1 with 2Gb RAM and one with 3Gb RAM). The host announced 27Gb RAM used which is way more than expected. And if I sumup all the RSS of processes actually running I got 6Gb. So here no debian9 involved anywhere as all containers are debian10 like the host.
So I investigated a bit and found kernel was not up-to-date because of ovh installer. I updated it (from 4.19.118-2+deb10u1 to 4.19.152-1) and will keep this topic updated with my discoveries :smile:

OK so after updating kernel on my lxd hosts and monitoring memory usage, I can say that it fixed my issue. I no longer see memory leak on my containers or my hosts.

Issue was that OVH is not installing linux-image-amd64 metapackage on their linux installs but specific kernel version.
So I had to install the meta-package and now my kernel will be kept updated on all my hosts with apt.

Thank you again for time spent listening to me and giving me useful hints on how to fix our issue :wink:

3 Likes