Swap high usage without obvious reason

Hello,

seeing high swap usage and it is not clear why this is happening. :- :thinking:

Host was 192 GB of RAM.
We run 18 containers now on host.
Almost all containers configuration use limits.memory and limits.memory.swap.
We also run 4 containers without limitation on memory and swap.

On host:

# free -g
               total        used        free      shared  buff/cache   available
Mem:             187          71          45           4          76         116
Swap:            103          57          46

For reference, here is also the same output, after dropping all caches:

# sync; echo 3 > /proc/sys/vm/drop_caches; free -g
               total        used        free      shared  buff/cache   available
Mem:             187          71         114           5           8         116
Swap:            103          57          46

Host is booted with the following related sysctl options:

vm.swappiness=1
vm.vfs_cache_pressure=60
vm.max_map_count=262144
vm.overcommit_memory = 1
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

Please note that swappiness is set to 1 on purpose, to avoid swapping as much as possible.

Here are ALL effective sysctl params on “vm.*” namespace:

# sysctl -a|grep 'vm.'
vm.admin_reserve_kbytes = 8192
vm.compact_unevictable_allowed = 1
vm.compaction_proactiveness = 20
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 5
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 15
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.extfrag_threshold = 500
vm.hugetlb_optimize_vmemmap = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256   256     32      0       0
vm.max_map_count = 262144
vm.memfd_noexec = 0
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 67584
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 65536
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.numa_stat = 1
vm.numa_zonelist_order = Node
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.page_lock_unfairness = 5
vm.panic_on_oom = 0
vm.percpu_pagelist_high_fraction = 0
vm.stat_interval = 1
vm.swappiness = 1
vm.unprivileged_userfaultfd = 0
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 60
vm.watermark_boost_factor = 15000
vm.watermark_scale_factor = 10
vm.zone_reclaim_mode = 0

And here is free -g output, inside each running container:

# incus list -c ns --format csv | grep RUNNING | cut -d',' -f1| xargs -I {} sh -c "echo -> {}:; incus exec {} -- free -g"
              total        used        free      shared  buff/cache   available
Mem:            187           0         187           0           0         187
Swap:           103           0         103
              total        used        free      shared  buff/cache   available
Mem:            187           8         179           0           0         179
Swap:           103          12          91
               total        used        free      shared  buff/cache   available
Mem:             187           2         185           0           0         185
Swap:            103           1         102
               total        used        free      shared  buff/cache   available
Mem:             187          10         177           0           0         177
Swap:            103           9          94
              total        used        free      shared  buff/cache   available
Mem:             16           5           7           1           2           8
Swap:            16           2          13
              total        used        free      shared  buff/cache   available
Mem:             12           3           8           0           0           8
Swap:             8           1           6
              total        used        free      shared  buff/cache   available
Mem:             32           4          26           0           0          26
Swap:             8           6           1
              total        used        free      shared  buff/cache   available
Mem:              4           0           3           0           0           3
Swap:             4           0           3
              total        used        free      shared  buff/cache   available
Mem:             10           2           5           0           1           6
Swap:             8           2           5
              total        used        free      shared  buff/cache   available
Mem:             16           4          10           0           0          10
Swap:             8           2           5
               total        used        free      shared  buff/cache   available
Mem:               3           0           2           0           0           2
Swap:              2           0           2
               total        used        free      shared  buff/cache   available
Mem:               4           2           1           0           0           1
Swap:              8           2           5
               total        used        free      shared  buff/cache   available
Mem:               8           1           6           0           0           6
Swap:              8           1           6
              total        used        free      shared  buff/cache   available
Mem:              8           0           7           0           0           7
Swap:             4           0           3
              total        used        free      shared  buff/cache   available
Mem:              8           1           4           0           1           5
Swap:             8           0           7
              total        used        free      shared  buff/cache   available
Mem:             16           4          10           0           1          11
Swap:             8           2           5
              total        used        free      shared  buff/cache   available
Mem:             16           0          14           0           0          15
Swap:             8           1           6
              total        used        free      shared  buff/cache   available
Mem:             16          10           4           0           0           5
Swap:            16           6           9

This is incus 6.1 on Fedora 40 (kernel 6.8.9-300.fc40.x86_64)

Any ideas?

Thank you.

To clarify:
Seems like vm.swappiness=1 that is already set on host, is not honored inside the containers.

What I want to achieve with vm.swappiness=1 is to use swap only when RAM is not available.
But it seems like containers swap much more than that.

Thank you.

Back under cgroup1 we could directly set the swappiness on a per-cgroup basis which was making this nice an easy, but cgroup2 has lost that ability so we have far less control than we once did.

/proc/sys/vm/swappiness is a global sysctl so we can’t tweak it on a per-container basis.
That said, it’s certainly odd that the host value doesn’t have the desired impact as the memory allocator should definitely be respecting it.

With cgroup2 your main option is to play with limits.memory.swap which you could set to false to effectively disable swap allocations for containers. That won’t technically prevent the kernel from swapping in an emergency situation, but the allocator will not see a big chunk of swap being available on top of the main memory for the container anymore.

Thank you so much for your prompt reply once again @stgraber.

By your experience, should I give a try to host vm.swappiness=0 (now 1) first?
Before trying to turn off swap inside the containers.

I don’t want containers to starve, so I am wondering which of the two seems “safest” for the case where a container exhausted it’s memory.limit and needs some more.

You can certainly try with swappiness=0 but given the available RAM you have, swappiness=1 should have done the trick…

After turning off and on swap (to empty swap), I tried both swappiness=0 on host or giving “false” value on limits.memory.swap on a demanding Container.
In both scenarios, swap on host remained 0.

But the demanding Container got severely frozen after running a memory demanding task inside after the given memory limit was reached. Actually host load average went very high when that happened.
It seems that swap was not kicking in to provide some extra memory properly.

I was expecting swap to kick in and save the day, but this was not the case.

@stgraber what is the reasoning with such regression from cgroup v1 to v2? is it planned to be fixed in v3? or it’s permanently broken by design?