Strange continuous read io burst when container is "low" in memory

jalbstmeijer · June 23, 2022, 12:33pm

Hi,

I have seen continuous read io burst on containers, that then become unavailable.
Only way to fix it is to restart the container.

I have been able to reproduce it on an AWS Ubuntu Jammy instance with a Ubuntu Jammy container running clamd with 1GB of container memory.

both with;
lxd 5.0.0 / kernel 5.15.0-1011-aws
lxd 5.1 / kernel 5.15.0-1005-aws
lxd 5.2 / kernel 5.15.0-1011-aws

github.com/lxc/lxc

Container in rest reaching memory limit, gets high cpu and starts using maximum disk read throughput for ever.

opened 09:30AM - 25 May 22 UTC

jalbstmeijer

* Distribution: Ubuntu * Distribution version: jammy-22.04 * The output of … * `lxc --version` : 5.1 * `lxc-checkconfig`: is not included * `uname -a`: Linux ip-10-11-21-54 5.15.0-1005-aws 7-Ubuntu SMP Wed Apr 20 03:44:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux * `cat /proc/self/cgroup`: 13:devices:/user.slice 12:hugetlb:/ 11:misc:/ 10:cpuset:/ 9:freezer:/ 8:memory:/user.slice/user-1000.slice/session-23.scope 7:perf_event:/ 6:cpu,cpuacct:/user.slice 5:rdma:/ 4:net_cls,net_prio:/ 3:pids:/user.slice/user-1000.slice/session-23.scope 2:blkio:/user.slice 1:name=systemd:/user.slice/user-1000.slice/session-23.scope 0::/user.slice/user-1000.slice/session-23.scope * `cat /proc/1/mounts` /dev/root / ext4 rw,relatime,discard,errors=remount-ro 0 0 devtmpfs /dev devtmpfs rw,relatime,size=1894916k,nr_inodes=473729,mode=755,inode64 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,size=760556k,nr_inodes=819200,mode=755,inode64 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64 0 0 cgroup2 /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children 0 0 cgroup /sys/fs/cgroup/misc cgroup rw,nosuid,nodev,noexec,relatime,misc 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=14362 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0 none /run/credentials/systemd-sysusers.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0 /dev/loop1 /snap/core20/1434 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/loop0 /snap/amazon-ssm-agent/5163 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/loop2 /snap/core18/2344 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/loop4 /snap/snapd/15534 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/nvme0n1p15 /boot/efi vfat rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0 tmpfs /run/snapd/ns tmpfs rw,nosuid,nodev,size=760556k,nr_inodes=819200,mode=755,inode64 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0 /dev/loop5 /snap/snapd/15904 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/loop6 /snap/core18/2409 squashfs ro,nodev,relatime,errors=continue 0 0 /dev/loop7 /snap/amazon-ssm-agent/5656 squashfs ro,nodev,relatime,errors=continue 0 0 tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=380276k,nr_inodes=95069,mode=700,uid=1000,gid=1000,inode64 0 0 /dev/loop3 /snap/lxd/23037 squashfs ro,nodev,relatime,errors=continue 0 0 nsfs /run/snapd/ns/lxd.mnt nsfs rw 0 0 tmpfs /var/snap/lxd/common/ns tmpfs rw,relatime,size=1024k,mode=700,inode64 0 0 nsfs /var/snap/lxd/common/ns/shmounts nsfs rw 0 0 nsfs /var/snap/lxd/common/ns/mntns nsfs rw 0 0 # Issue description In production we have seen regularly containers become unavailable when memory limits get reached. Most of the time it gets preceded by an OOM kill action and then most if not all processes in the container start reading at full capacity from disk. As far as we can see the container disk hits maximum throughput speed and stays there till the container is stopped. (have left it for hours once in this state) I can imagine a read peek and higher read throughput when there is almost no disk cache available in memory, but the behavior we see seems not to be expected or desirable. luckily we use dedicated disks for each container, otherwise it would probably take out the complete instance and all the containers on it. An important side note is that we also see this happen on containers that are in no way under load and or publicly available to be able to explain a constant need to read at maximum capacity. Also we see processes at maximum read capacity, from which you would not expect much reads at all (init, nscd, crond etc). In production we run on LXD/LXC on AmazonLinux, so I decided to reproduce it on a dedicated default Ubuntu instance on AWS to isolate the problem and rule out as much of the exotic choices we made. (OS/kernel/settings). # Steps to reproduce 1. Start the Ubuntu instance based on Ubuntu jammy-22.04-amd64 (ami-07bd2fc45c8a8dd48 in eu-west-1) As I could not start a AmazonLinux container: (Error: The image used by this instance requires a CGroupV1 host system) I had to change this and reboot; /etc/default/grub GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=false" update-grub2 reboot 3. Run lxd init; Would you like to use LXD clustering? (yes/no) [default=no]: Do you want to configure a new storage pool? (yes/no) [default=yes]: no Would you like to connect to a MAAS server? (yes/no) [default=no]: Would you like to create a new local network bridge? (yes/no) [default=yes]: What should the new bridge be called? [default=lxdbr0]: What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 10.11.21.121/24 Would you like LXD to NAT IPv4 traffic on your bridge? [default=yes]: What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: none Would you like the LXD server to be available over the network? (yes/no) [default=no]: no Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: no Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: yes config: images.auto_update_interval: "0" networks: - config: ipv4.address: 10.11.21.121/24 ipv4.nat: "true" ipv6.address: none description: "" name: lxdbr0 type: "" project: default storage_pools: [] profiles: - config: {} description: "" devices: eth0: name: eth0 network: lxdbr0 type: nic name: default projects: [] cluster: null 4. Attach a dedicated swap and container disk to the instance 5. Enable swap and create storage mkswap /dev/nvme1n1 -L swap echo "LABEL=swap none swap defaults,nofail 0 0" >> /etc/fstab swapon -a lxc storage create burst1 lvm source=/dev/nvme2n1 6. Create an AmazonLinux container lxc launch -s burst1 images:amazonlinux burst1 7. Install clamd on the container lxc exec -t burst1 -- sh -c "yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm" lxc exec -t burst1 -- sh -c "yum install -y iputils procps-ng clamd clamav-server clamav-data clamav-update clamav-filesystem clamav clamav-scanner-systemd clamav-devel clamav-lib clamav-server-systemd" lxc exec -t burst1 -- sh -c "sed -i s/^#LocalSocket/LocalSocket/g /etc/clamd.d/scan.conf" lxc exec -t burst1 -- sh -c "systemctl enable clamd@scan" lxc exec -t burst1 -- sh -c "systemctl start clamd@scan" 5. Limit the memory to 1GB and restart the container lxc stop burst1 lxc config set burst1 limits.memory 1GB lxc start burst1 6. See top, iotop and the AWS volume metrics to see high cpu and read io usage iotop: Total DISK READ: 128.17 M/s | Total DISK WRITE: 0.00 B/s Current DISK READ: 128.17 M/s | Current DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 6244 be/4 1000000 42.76 M/s 0.00 B/s ?unavailable? init 6648 be/4 1000000 42.65 M/s 0.00 B/s ?unavailable? systemd-hostnamed 6649 be/4 1000998 42.76 M/s 0.00 B/s ?unavailable? clamd -c /etc/clamd.d/scan.conf Total DISK READ: 128.25 M/s | Total DISK WRITE: 0.00 B/s Current DISK READ: 128.25 M/s | Current DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 6244 be/4 1000000 32.37 M/s 0.00 B/s ?unavailable? init 6639 be/4 1000081 32.33 M/s 0.00 B/s ?unavailable? dbus-daemon --system --add~idfile --systemd-activation 6642 be/4 1000000 32.61 M/s 0.00 B/s ?unavailable? crond -n 6649 be/4 1000998 30.94 M/s 0.00 B/s ?unavailable? clamd -c /etc/clamd.d/scan.conf # Information to attach Find dmesg, lxc.log and lxc.conf and AWS volume metric read-io.png attached ![read-io](https://user-images.githubusercontent.com/362000/170229993-548e175f-5647-469b-a5d1-643a8b8dd690.png) [lxc.conf.txt](https://github.com/lxc/lxc/files/8769741/lxc.conf.txt) [lxc.log](https://github.com/lxc/lxc/files/8769742/lxc.log) [dmesg.txt](https://github.com/lxc/lxc/files/8769743/dmesg.txt)

Can anyone confirm this is a bug?

Regards,

Justin

stgraber · June 23, 2022, 2:32pm

It’s not a bug, it’s normal Linux behavior.

When you run out of memory, even in a container, the kernel runs out of VFS cache space.
So instead of being able to hold your open files content in cache memory, it will have to re-fetch the data over and over and over again.

Technically Linux does what you asked it to do, it’s not exceeding the memory limit and is trying not to trigger the OOM killer but this comes at the cost of having no cache space for data and so constantly re-reading it from disk.

jalbstmeijer · June 23, 2022, 3:18pm

@stgraber thank you very much for your reply.

As I mentioned in the github issue, running out of VFS cache surely would explain partly the issue seen.

But somethings make me wonder;

The container is doing nothing. It only has the initial system processes and clamd running. No virus scanning is taking place. So what are these processes reading at maximum throughput (128MB/s) without ever stopping? (init/@dbus-daemon/clamd/systemd-hostnamed)
If memory is so critically low that a container is becoming unavailable, why is OOM not intervening?
I don’t see swap being used.
Before I had all containers sharing a disk, when this issue would occur the whole server became unavailable.
I migrated after 10 years away from OpenVZ to LXD, never seen containers in OpenVZ be so self destructive. (no pun intended)

If it is like you say, by design. What can I do to either have OOM be more aggressive or improve the behavior/impact in general.
Sure I could remove all memory limits on my containers, but I think I’m then just covering up the issue.

Regards, Justin

stgraber · June 23, 2022, 4:36pm

The OOM killer will only kick in if flushing all caches/buffers doesn’t yield enough memory to serve the allocation.

You may want to strace the different processes to see if something odd is going on.

Another interesting user of memory which isn’t super visible is tmpfs, so you may want to check that you don’t have a near-full tmpfs in that instance too.

jalbstmeijer · June 26, 2022, 10:36am

@stgraber again thank you for your response, very much appreciated.

tmpfs in the container does not look as a possible source of issues;

root@burst1:~# df -h |grep tmpfs
tmpfs 100K 0 100K 0% /dev/lxd
tmpfs 100K 0 100K 0% /dev/.lxd-mounts
tmpfs 1.9G 0 1.9G 0% /dev/shm
tmpfs 774M 156K 773M 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock

I tried two things now.

try to catch a strace

iotop output:

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND

21902 be/4 1000000 25.25 M/s 0.00 B/s ?unavailable? init
21980 be/4 1000000 25.67 M/s 0.00 B/s ?unavailable? systemd-journald
22049 be/4 1000102 26.84 M/s 0.00 B/s ?unavailable? @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
22057 be/4 1000106 23.30 M/s 0.00 B/s ?unavailable? clamd --foreground=true
22070 be/4 1000000 26.57 M/s 0.00 B/s ?unavailable? polkitd --no-debug [gdbus]

running strace on these PID’s and stop it after 10 seconds.

strace -f -p 21902

strace: Process 21902 attached
strace: Process 21902 detached

strace -f -p 21980

strace: Process 21980 attached
gettid() = 71
epoll_wait(6,
strace: Process 21980 detached
<detached …>

strace -f -p 22049

strace: Process 22049 attached
strace: Process 22049 detached

strace -f -p 22057

strace: Process 22057 attached
strace: Process 22057 detached

strace -f -p 22070

strace: Process 22070 attached with 3 threads
[pid 22065] restart_syscall(<… resuming interrupted read …> <unfinished …>
[pid 22067] restart_syscall(<… resuming interrupted read …>
strace: Process 22070 detached
strace: Process 22065 detached
strace: Process 22067 detached
<detached …>

According to strace these processes are not doing much.

try to proof that shortage or non existing vfs cache could cause this behavior for this specific situation.

lxc stop -f burst1
lxc config set burst1 limits.memory 1500MB (1500MB mem does not trigger the io read burst)

start this script in a screen on the server to mimic no VFS;

#!/bin/bash
while true; do
sync; echo 3 > /proc/sys/vm/drop_caches
sleep 1
done

lxc start burst1

This does not trigger the read io burst.
The server has a steady 1MB/s io read after the container has started.
Nothing compared to the 128MB/s when the io burst occurs.

jalbstmeijer · July 4, 2022, 3:25pm

Went back to openvz to see how this is handled there.

Installed a Centos7 container on kernel-2.6.32-042stab145.3

vzctl set burst1 --ram 1G --swap 1G --dcachesize 256M --save (sort sure if this is fully comparable to lxc config set burst1 limits.memory 1GB)
vzctl exec burst1 yum -y install epel-release
vzctl exec burst1 yum -y install clamd
vzctl exec burst1 freshclam
vzctl exec burst1 sed -i s/^#LocalSocket/LocalSocket/g /etc/clamd.d/scan.conf
vzctl exec burst1 systemctl enable clamd@scan
vzctl exec burst1 systemctl start clamd@scan.service

I don’t seem to be getting any read io burst. I have seem clamd being killed when trying to start other things in the container.

jalbstmeijer · July 13, 2022, 10:25am

@stgraber please let me know if providing access to an ec2 instance to reproduce this io burst would help, instead of following my reproduction steps.

stgraber · July 13, 2022, 9:28pm

I’m not really sure how to debug this further to be honest.

LXD/LXC configure your memory limit in the kernel through the memory cgroup controller. How that’s enforced and the effect you’ll see when getting close to the limit isn’t something that we have any control over unfortunately.

It’s possible that the OpenVZ patchset (which at least used to heavily modify some of the resource tracking logic) is behaving differently than stock Linux, but it’s also possible that cgroup2 and the whole memory pressure (PSI) work that’s happened over the past few years has changed how Linux handles such cases.

Unfortunately comparing behavior between a heavily patched 2.6.32 and mostly stock 5.15 isn’t going to be very helpful here.

Can you show what you have in ls -lh /sys/fs/cgroup? Just want to see if dealing with cgroup1 or cgroup2 as the behavior could differ between the two, as does the information available about high memory pressure.

jalbstmeijer · July 14, 2022, 8:01am

@straber thank you again for your reply.

I was not expecting any debugging yet.

Firstly I was looking for someone with extensive LXD/LXC knowledge to reproduce what I see.
Then agree or disagree the behavior is as expected, desirable and or possibly a bug.

But maybe you already reproduced what I see?

If there is agreement this does not look like desirable behavior, pinpoint where this is controlled, try to make the reproduction steps closer to the source of the problem and open a report upstream.

The only thing I tried to show with OpenVZ and the test scenario whereby I “disabled” VFS cache, that it does not seems to be like you said before “Linux does what you asked it to do” and therefore having processes endlessly reading from disk at highest possible throughput without these processes showing anything in a strace does not look like the Linux solution kicking in for shortage of memory.

Regards, Justin

jalbstmeijer · July 14, 2022, 9:18am

I’m not sure if it is a cgroupv1 vs v2 issue.
As I think Amazon Linux2 uses cgroupv1 and tested ubuntu with both cgroupv1 and cgroupv2

Here the output you requested:

Amazon Linux 2 lxd 5.3 2 5.10.112-108.499 on which I can reproduce the issue with a Amazon Linux 2 container:

ls -lh /sys/fs/cgroup /proc/cgroups
-r–r–r-- 1 root root 0 Jul 14 10:55 /proc/cgroups

/sys/fs/cgroup:
total 0
dr-xr-xr-x 6 root root 0 Jul 14 10:55 blkio
lrwxrwxrwx 1 root root 11 Jul 14 10:55 cpu → cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jul 14 10:55 cpuacct → cpu,cpuacct
dr-xr-xr-x 6 root root 0 Jul 14 10:55 cpu,cpuacct
dr-xr-xr-x 4 root root 0 Jul 14 10:55 cpuset
dr-xr-xr-x 6 root root 0 Jul 14 10:55 devices
dr-xr-xr-x 5 root root 0 Jul 14 10:55 freezer
dr-xr-xr-x 4 root root 0 Jul 14 10:55 hugetlb
dr-xr-xr-x 6 root root 0 Jul 14 10:55 memory
lrwxrwxrwx 1 root root 16 Jul 14 10:55 net_cls → net_cls,net_prio
dr-xr-xr-x 4 root root 0 Jul 14 10:55 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jul 14 10:55 net_prio → net_cls,net_prio
dr-xr-xr-x 4 root root 0 Jul 14 10:55 perf_event
dr-xr-xr-x 6 root root 0 Jul 14 10:55 pids
dr-xr-xr-x 6 root root 0 Jul 14 10:55 systemd

On Ubuntu jammy-22.04 lxd 5.3 / kernel 5.15.0-1011-aws, on which I can reproduce the issue with an Amazon linux 2 container (I assume cgroup1, as I had to set kernel parameterGRUB_CMDLINE_LINUX=“systemd.unified_cgroup_hierarchy=false” to get the Amazon container to start )

ls -lh /sys/fs/cgroup /proc/cgroup
ls: cannot access ‘/proc/cgroup’: No such file or directory
/sys/fs/cgroup:
total 0
dr-xr-xr-x 12 root root 0 Jul 14 08:55 blkio
lrwxrwxrwx 1 root root 11 Jul 14 08:55 cpu → cpu,cpuacct
dr-xr-xr-x 12 root root 0 Jul 14 08:55 cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jul 14 08:55 cpuacct → cpu,cpuacct
dr-xr-xr-x 3 root root 0 Jul 14 08:55 cpuset
dr-xr-xr-x 12 root root 0 Jul 14 08:55 devices
dr-xr-xr-x 4 root root 0 Jul 14 08:55 freezer
dr-xr-xr-x 3 root root 0 Jul 14 08:55 hugetlb
dr-xr-xr-x 12 root root 0 Jul 14 08:55 memory
dr-xr-xr-x 3 root root 0 Jul 14 08:55 misc
lrwxrwxrwx 1 root root 16 Jul 14 08:55 net_cls → net_cls,net_prio
dr-xr-xr-x 3 root root 0 Jul 14 08:55 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jul 14 08:55 net_prio → net_cls,net_prio
dr-xr-xr-x 3 root root 0 Jul 14 08:55 perf_event
dr-xr-xr-x 12 root root 0 Jul 14 08:55 pids
dr-xr-xr-x 3 root root 0 Jul 14 08:55 rdma
dr-xr-xr-x 13 root root 0 Jul 14 08:55 systemd
dr-xr-xr-x 13 root root 0 Jul 14 09:11 unified

On Ubuntu jammy-22.04 lxd 5.3 / kernel 5.15.0-1011-aws, on which I can reproduce the issue with an Ubuntu jammy-22.04 container (I assume cgroup2)

ls -lh /sys/fs/cgroup /proc/cgroups
-r–r–r-- 1 root root 0 Jul 14 08:30 /proc/cgroups

/sys/fs/cgroup:
total 0
-r–r–r-- 1 root root 0 Jul 14 08:30 cgroup.controllers
-rw-r–r-- 1 root root 0 Jul 14 08:31 cgroup.max.depth
-rw-r–r-- 1 root root 0 Jul 14 08:31 cgroup.max.descendants
-rw-r–r-- 1 root root 0 Jul 14 08:31 cgroup.procs
-r–r–r-- 1 root root 0 Jul 14 08:31 cgroup.stat
-rw-r–r-- 1 root root 0 Jul 14 08:31 cgroup.subtree_control
-rw-r–r-- 1 root root 0 Jul 14 08:31 cgroup.threads
-rw-r–r-- 1 root root 0 Jul 14 08:31 cpu.pressure
-r–r–r-- 1 root root 0 Jul 14 08:31 cpu.stat
-r–r–r-- 1 root root 0 Jul 14 08:31 cpuset.cpus.effective
-r–r–r-- 1 root root 0 Jul 14 08:31 cpuset.mems.effective
drwxr-xr-x 2 root root 0 Jul 14 08:31 dev-hugepages.mount
drwxr-xr-x 2 root root 0 Jul 14 08:31 dev-mqueue.mount
drwxr-xr-x 2 root root 0 Jul 14 08:30 init.scope
-rw-r–r-- 1 root root 0 Jul 14 08:31 io.cost.model
-rw-r–r-- 1 root root 0 Jul 14 08:31 io.cost.qos
-rw-r–r-- 1 root root 0 Jul 14 08:31 io.pressure
-rw-r–r-- 1 root root 0 Jul 14 08:31 io.prio.class
-r–r–r-- 1 root root 0 Jul 14 08:31 io.stat
drwxr-xr-x 2 root root 0 Jul 14 08:34 lxc.pivot
-r–r–r-- 1 root root 0 Jul 14 08:31 memory.numa_stat
-rw-r–r-- 1 root root 0 Jul 14 08:31 memory.pressure
-r–r–r-- 1 root root 0 Jul 14 08:31 memory.stat
-r–r–r-- 1 root root 0 Jul 14 08:31 misc.capacity
drwxr-xr-x 2 root root 0 Jul 14 08:31 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x 2 root root 0 Jul 14 08:31 sys-fs-fuse-connections.mount
drwxr-xr-x 2 root root 0 Jul 14 08:31 sys-kernel-config.mount
drwxr-xr-x 2 root root 0 Jul 14 08:31 sys-kernel-debug.mount
drwxr-xr-x 2 root root 0 Jul 14 08:31 sys-kernel-tracing.mount
drwxr-xr-x 40 root root 0 Jul 14 08:38 system.slice
drwxr-xr-x 3 root root 0 Jul 14 08:34 user.slice

Centos/OpenVZ6 2.6.32-042stab145.3 on which I cannot reproduce the issue:

ls -lh /sys/fs/cgroup /proc/cgroups
ls: cannot access /sys/fs/cgroup: No such file or directory
-r–r–r-- 1 root root 0 Jul 14 10:45 /proc/cgroups

Regards, Justin

stgraber · July 15, 2022, 12:12am

It may be interesting to look at /sys/fs/cgroup/lxc.payload.CONTAINER-NAME/, specifically the files:

memory.events
memory.events.local
memory.pressure
memory.stat

As that can give you some hints as to why the kernel is behaving in the way that it is.

jalbstmeijer · July 15, 2022, 9:20am

@stgraber thank you for your suggestion.

Made a small script that starts the container and then collects the dedicated container disk read throughput and the cgroup values you mentioned 10 times.

I tested on the Ubuntu jammy-22.04 server with a Ubuntu jammy-22.04 container on kernel 5.15.0-1011-aws

Here the output;

starting container burst1 Fri Jul 15 09:12:36 UTC 2022

– measurement 1 Fri Jul 15 09:12:36 UTC 2022
25 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 407
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 407
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=22448
full avg10=0.00 avg60=0.00 avg300=0.00 total=22448

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 921116672
file 555839488
kernel_stack 360448
pagetables 2965504
percpu 250960
sock 4096
shmem 167936
file_mapped 41738240
file_dirty 909312
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 921149440
active_anon 135168
inactive_file 339181568
active_file 216489984
unevictable 0
slab_reclaimable 16465104
slab_unreclaimable 1854104
slab 18319208
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 313029
pgmajfault 384
pgrefill 2
pgscan 23410
pgsteal 21840
pgactivate 51705
pgdeactivate 2
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 2 Fri Jul 15 09:12:47 UTC 2022
108 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 9332
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 9332
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=10.62 avg60=2.39 avg300=0.52 total=1621622
full avg10=10.05 avg60=2.27 avg300=0.49 total=1560154

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489858560
file 1138688
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 679936
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489891328
active_anon 135168
inactive_file 565248
active_file 405504
unevictable 0
slab_reclaimable 1543744
slab_unreclaimable 1844760
slab 3388504
workingset_refault_anon 0
workingset_refault_file 240267
workingset_activate_anon 0
workingset_activate_file 22680
workingset_restore_anon 0
workingset_restore_file 11930
workingset_nodereclaim 494
pgfault 472278
pgmajfault 8785
pgrefill 542365
pgscan 959406
pgsteal 436618
pgactivate 491720
pgdeactivate 515443
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 3 Fri Jul 15 09:12:57 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 16749
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 16749
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=18.05 avg60=5.49 avg300=1.26 total=3990903
full avg10=17.39 avg60=5.26 avg300=1.21 total=3830513

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489932288
file 757760
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 86016
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489965056
active_anon 135168
inactive_file 192512
active_file 266240
unevictable 0
slab_reclaimable 1554976
slab_unreclaimable 1844760
slab 3399736
workingset_refault_anon 0
workingset_refault_file 569285
workingset_activate_anon 0
workingset_activate_file 47149
workingset_restore_anon 0
workingset_restore_file 26548
workingset_nodereclaim 496
pgfault 500858
pgmajfault 20780
pgrefill 906890
pgscan 1788933
pgsteal 765798
pgactivate 806939
pgdeactivate 855295
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 4 Fri Jul 15 09:13:07 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 30551
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 30551
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=22.43 avg60=8.48 avg300=2.06 total=6553415
full avg10=21.89 avg60=8.20 avg300=1.99 total=6297730

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 905216
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 16384
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 557056
active_file 180224
unevictable 0
slab_reclaimable 1551152
slab_unreclaimable 1844760
slab 3395912
workingset_refault_anon 0
workingset_refault_file 898388
workingset_activate_anon 0
workingset_activate_file 56004
workingset_restore_anon 0
workingset_restore_file 32521
workingset_nodereclaim 496
pgfault 524801
pgmajfault 31696
pgrefill 1393683
pgscan 6049803
pgsteal 1095033
pgactivate 1277976
pgdeactivate 1335208
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 5 Fri Jul 15 09:13:17 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 48506
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 48506
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=23.09 avg60=10.82 avg300=2.80 total=8997331
full avg10=22.80 avg60=10.55 avg300=2.72 total=8703356

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 860160
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 8192
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 503808
active_file 151552
unevictable 0
slab_reclaimable 1551744
slab_unreclaimable 1844760
slab 3396504
workingset_refault_anon 0
workingset_refault_file 1227271
workingset_activate_anon 0
workingset_activate_file 57080
workingset_restore_anon 0
workingset_restore_file 33046
workingset_nodereclaim 496
pgfault 545967
pgmajfault 42131
pgrefill 1914061
pgscan 12821287
pgsteal 1423946
pgactivate 1796667
pgdeactivate 1854982
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 6 Fri Jul 15 09:13:27 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 66851
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 66851
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=23.61 avg60=12.83 avg300=3.51 total=11428288
full avg10=23.27 avg60=12.55 avg300=3.43 total=11113295

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 1036288
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 688128
active_file 180224
unevictable 0
slab_reclaimable 1551744
slab_unreclaimable 1844760
slab 3396504
workingset_refault_anon 0
workingset_refault_file 1555768
workingset_activate_anon 0
workingset_activate_file 57533
workingset_restore_anon 0
workingset_restore_file 33191
workingset_nodereclaim 496
pgfault 566763
pgmajfault 52489
pgrefill 2455322
pgscan 20105745
pgsteal 1752445
pgactivate 2337404
pgdeactivate 2396165
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 7 Fri Jul 15 09:13:37 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 95473
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 95473
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=52.25 avg60=21.01 avg300=5.61 total=18166296
full avg10=50.39 avg60=20.37 avg300=5.44 total=17570050

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 1044480
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 4096
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 733184
active_file 94208
unevictable 0
slab_reclaimable 1549904
slab_unreclaimable 1843536
slab 3393440
workingset_refault_anon 0
workingset_refault_file 1884316
workingset_activate_anon 0
workingset_activate_file 58773
workingset_restore_anon 0
workingset_restore_file 33307
workingset_nodereclaim 496
pgfault 587292
pgmajfault 62840
pgrefill 3655215
pgscan 33569473
pgsteal 2080991
pgactivate 3535994
pgdeactivate 3596013
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 8 Fri Jul 15 09:13:47 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 124556
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 124556
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=62.22 avg60=28.19 avg300=7.71 total=24971288
full avg10=59.74 avg60=27.21 avg300=7.45 total=24110139

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 1044480
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 65536
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 630784
active_file 245760
unevictable 0
slab_reclaimable 1549904
slab_unreclaimable 1843536
slab 3393440
workingset_refault_anon 0
workingset_refault_file 2213087
workingset_activate_anon 0
workingset_activate_file 60240
workingset_restore_anon 0
workingset_restore_file 33516
workingset_nodereclaim 496
pgfault 608744
pgmajfault 73476
pgrefill 4859003
pgscan 47129241
pgsteal 2409762
pgactivate 4737672
pgdeactivate 4799124
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 9 Fri Jul 15 09:13:57 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 150729
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 150729
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=63.56 avg60=33.74 avg300=9.62 total=31468963
full avg10=59.85 avg60=32.24 avg300=9.22 total=30149246

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 1044480
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 65536
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 622592
active_file 253952
unevictable 0
slab_reclaimable 1550128
slab_unreclaimable 1846464
slab 3396592
workingset_refault_anon 0
workingset_refault_file 2541878
workingset_activate_anon 0
workingset_activate_file 62844
workingset_restore_anon 0
workingset_restore_file 34019
workingset_nodereclaim 496
pgfault 631941
pgmajfault 84562
pgrefill 5851979
pgscan 58396971
pgsteal 2738553
pgactivate 5725774
pgdeactivate 5789927
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

– measurement 10 Fri Jul 15 09:14:07 UTC 2022
128 MB_read/s

– /sys/fs/cgroup/lxc.payload.burst1/memory.events
low 0
high 0
max 172472
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.events.local
low 0
high 0
max 172472
oom 0
oom_kill 0

– /sys/fs/cgroup/lxc.payload.burst1/memory.pressure
some avg10=46.12 avg60=34.62 avg300=10.65 total=35507940
full avg10=44.01 avg60=33.14 avg300=10.21 total=34024471

– /sys/fs/cgroup/lxc.payload.burst1/memory.stat
anon 1489952768
file 999424
kernel_stack 344064
pagetables 4087808
percpu 157520
sock 4096
shmem 167936
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 1489985536
active_anon 135168
inactive_file 643072
active_file 188416
unevictable 0
slab_reclaimable 1550128
slab_unreclaimable 1848344
slab 3398472
workingset_refault_anon 0
workingset_refault_file 2870294
workingset_activate_anon 0
workingset_activate_file 65391
workingset_restore_anon 0
workingset_restore_file 35387
workingset_nodereclaim 496
pgfault 653764
pgmajfault 95170
pgrefill 6624339
pgscan 67260257
pgsteal 3067022
pgactivate 6493981
pgdeactivate 6560697
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

Reindeer · December 15, 2023, 9:51pm

Has this issue been resolved?

Strange continuous read io burst when container is "low" in memory

strace -f -p 21902

strace -f -p 21980

strace -f -p 22049

strace -f -p 22057

strace -f -p 22070

start this script in a screen on the server to mimic no VFS;

#!/bin/bash while true; do sync; echo 3 > /proc/sys/vm/drop_caches sleep 1 done

#!/bin/bash
while true; do
sync; echo 3 > /proc/sys/vm/drop_caches
sleep 1
done