Strange continuous read io burst when container is "low" in memory

Hi,

I have seen continuous read io burst on containers, that then become unavailable.
Only way to fix it is to restart the container.

I have been able to reproduce it on an AWS Ubuntu Jammy instance with a Ubuntu Jammy container running clamd with 1GB of container memory.

both with;
lxd 5.0.0 / kernel 5.15.0-1011-aws
lxd 5.1 / kernel 5.15.0-1005-aws
lxd 5.2 / kernel 5.15.0-1011-aws

Can anyone confirm this is a bug?

Regards,

Justin

It’s not a bug, it’s normal Linux behavior.

When you run out of memory, even in a container, the kernel runs out of VFS cache space.
So instead of being able to hold your open files content in cache memory, it will have to re-fetch the data over and over and over again.

Technically Linux does what you asked it to do, it’s not exceeding the memory limit and is trying not to trigger the OOM killer but this comes at the cost of having no cache space for data and so constantly re-reading it from disk.

@stgraber thank you very much for your reply.

As I mentioned in the github issue, running out of VFS cache surely would explain partly the issue seen.

But somethings make me wonder;

  • The container is doing nothing. It only has the initial system processes and clamd running. No virus scanning is taking place. So what are these processes reading at maximum throughput (128MB/s) without ever stopping? (init/@dbus-daemon/clamd/systemd-hostnamed)

  • If memory is so critically low that a container is becoming unavailable, why is OOM not intervening?

  • I don’t see swap being used.

  • Before I had all containers sharing a disk, when this issue would occur the whole server became unavailable.

  • I migrated after 10 years away from OpenVZ to LXD, never seen containers in OpenVZ be so self destructive. (no pun intended)

If it is like you say, by design. What can I do to either have OOM be more aggressive or improve the behavior/impact in general.
Sure I could remove all memory limits on my containers, but I think I’m then just covering up the issue.

Regards, Justin

The OOM killer will only kick in if flushing all caches/buffers doesn’t yield enough memory to serve the allocation.

You may want to strace the different processes to see if something odd is going on.

Another interesting user of memory which isn’t super visible is tmpfs, so you may want to check that you don’t have a near-full tmpfs in that instance too.