Strange size and behavior with tmpfs

I see strange behavior with tmpfs size.

ram="2GB"
lxc config set ${ct} limits.memory ${ram}
lxc exec ${ct} bash

# free
              total        used        free      shared  buff/cache   available
Mem:        1953124      549988     1352468         128       50668     1403136
Swap:        723776           0      723776

cat << EOF >> /etc/fstab
tmpfs /tmp tmpfs rw,relatime 0 0
tmpfs /var/tmp tmpfs rw,relatime 0 0
EOF
mount -a

df -h
Filesystem                Size  Used Avail Use% Mounted on
...
tmpfs                      32G  4.0K   32G   1% /tmp
tmpfs                      32G     0   32G   0% /var/tmp

Note, 32G is half of the server’s memory, not the container memory!!!

cd /tmp && head -c 10G </dev/urandom >myfile
Killed

And the whole(!!!) container is killed after that, so it’s possible to make a dos attack on containers with /tmp on tmpfs.

I have latest 3.18 LXD from snap.

Is it a bug, should I create an issue on github?

Is it possible to fix somehow?

I don’t expect this to be fixable, other than you specifying the size of your tmpfs in /etc/fstab.

Basically the kernel doesn’t care about the cgroup memory restrictions, they do apply and will kick in if you go over them, but anything that’s based on the amount of total system memory will usually look at the actual amount, not the cgroup amount.

@brauner

There’s no DOS. You can only mount tmpfs as userns root and the oom-killer has taken you down once you’ve gone over your 2GB limit and that’s perfectly fine. Any privileged process on the host can do the same thing.
If you look at dmesg you’ll see:

[54192.485604] head invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[54192.485607] CPU: 3 PID: 14963 Comm: head Tainted: P     U     O      5.3.0-22-lowlatency #24-Ubuntu
[54192.485607] Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET65W (1.40 ) 07/02/2019
[54192.485608] Call Trace:
[54192.485614]  dump_stack+0x63/0x8a
[54192.485616]  dump_header+0x4f/0x200
[54192.485617]  oom_kill_process.cold+0xb/0x10
[54192.485618]  out_of_memory.part.0+0x1df/0x3e0
[54192.485619]  out_of_memory+0x6d/0xd0
[54192.485621]  mem_cgroup_out_of_memory+0xbd/0xe0
[54192.485623]  try_charge+0x794/0x810
[54192.485624]  mem_cgroup_try_charge+0x71/0x1e0
[54192.485625]  mem_cgroup_try_charge_delay+0x22/0x50
[54192.485627]  shmem_getpage_gfp+0x1d7/0x940
[54192.485628]  ? __switch_to_asm+0x34/0x70
[54192.485629]  ? __switch_to_asm+0x40/0x70
[54192.485631]  shmem_write_begin+0x39/0x60
[54192.485633]  generic_perform_write+0xba/0x1c0
[54192.485634]  ? file_update_time+0x62/0x140
[54192.485635]  __generic_file_write_iter+0x107/0x1d0
[54192.485636]  generic_file_write_iter+0xb8/0x150
[54192.485638]  new_sync_write+0x125/0x1c0
[54192.485639]  __vfs_write+0x29/0x40
[54192.485640]  vfs_write+0xb9/0x1a0
[54192.485642]  ksys_write+0x67/0xe0
[54192.485643]  __x64_sys_write+0x1a/0x20
[54192.485645]  do_syscall_64+0x5a/0x130
[54192.485646]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54192.485647] RIP: 0033:0x7fa752803154
[54192.485649] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[54192.485650] RSP: 002b:00007ffc7fb56918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[54192.485651] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fa752803154
[54192.485651] RDX: 0000000000001000 RSI: 00007ffc7fb57a00 RDI: 0000000000000001
[54192.485652] RBP: 00007ffc7fb57a00 R08: 0000000000001000 R09: 0000000000000000
[54192.485652] R10: 000055c17b420010 R11: 0000000000000246 R12: 00007fa752adf760
[54192.485652] R13: 0000000000001000 R14: 00007fa752ada760 R15: 0000000000001000
[54192.485654] memory: usage 1953124kB, limit 1953124kB, failcnt 109730
[54192.485655] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[54192.485655] kmem: usage 17948kB, limit 9007199254740988kB, failcnt 0
[54192.485655] Memory cgroup stats for /lxc.payload/b1:
[54192.485663] anon 17252352
               file 1964216320
               kernel_stack 368640
               slab 14635008
               sock 0
               shmem 1964556288
               file_mapped 0
               file_dirty 0
               file_writeback 478089216
               anon_thp 0
               inactive_anon 1282924544
               active_anon 698867712
               inactive_file 0
               active_file 0
               unevictable 0
               slab_reclaimable 9498624
               slab_unreclaimable 5136384
               pgfault 27390
               pgmajfault 0
               workingset_refault 924
               workingset_activate 0
               workingset_nodereclaim 0
               pgrefill 6023
               pgscan 8368927
               pgsteal 326568
               pgactivate 306075
               pgdeactivate 5283
               pglazyfree 0
               pglazyfreed 0
               thp_fault_alloc 0
               thp_collapse_alloc 0
[54192.485664] Tasks state (memory values in pages):
[54192.485664] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[54192.485666] [  14709] 100000 14709    56155     1721   200704       28             0 systemd
[54192.485667] [  14778] 100000 14778    19611     2051   180224        0             0 systemd-journal
[54192.485668] [  14782] 100000 14782    10525      781   110592        0             0 systemd-udevd
[54192.485669] [  14797] 100100 14797    20010     1068   180224        0             0 systemd-network
[54192.485670] [  14817] 100101 14817    17656     1055   172032        0             0 systemd-resolve
[54192.485671] [  14820] 100000 14820    17618     1198   172032        0             0 systemd-logind
[54192.485672] [  14821] 100103 14821    12481      819   147456        0             0 dbus-daemon
[54192.485673] [  14822] 100102 14822    48349      835   147456        0             0 rsyslogd
[54192.485674] [  14823] 100000 14823     7822      704   106496        0             0 cron
[54192.485675] [  14824] 100000 14824    42586     2767   229376        0             0 networkd-dispat
[54192.485676] [  14860] 100000 14860    16191      837   172032        0             0 su
[54192.485677] [  14866] 100000 14866     5367      670    94208        0             0 bash
[54192.485678] [  14963] 100000 14963     1870      388    57344        0             0 head
[54192.485679] [  14861] 100000 14861    19129     1498   192512        0             0 systemd
[54192.485680] [  14862] 100000 14862    27209      671   241664       22             0 (sd-pam)
[54192.485681] [  14827] 100000 14827     3988      526    81920        0             0 agetty
[54192.485682] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b1,mems_allowed=0,oom_memcg=/lxc.payload/b1,task_memcg=/lxc.payload/b1,task=networkd-dispat,pid=14824,uid=100000
[54192.485693] Memory cgroup out of memory: Killed process 14824 (networkd-dispat) total-vm:170344kB, anon-rss:7704kB, file-rss:3364kB, shmem-rss:0kB
[54192.486149] oom_reaper: reaped process 14824 (networkd-dispat), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Is this behavior (kill container by oom) really right way?

I expect process will be killed, not the whole container.

As far as I remember this is the way how it works on plain linux and in OpenVZ.

The whole container doesn’t get killed, it’s just a normal OOM killer run restricted by the process list inside the cgroup.

The OOM killer never guaranteed it would kill the process which just went above the memory limit, it instead uses a score system to pick what process to kill. When a memory allocation fails due to running out of memory, the process with the most likely score will get killed. If not sufficient, the second most likely process gets killed, …

ok then, is there a way to auto restart container in case it was killed by oom?

Interesting, that’s the default behaviour of tmpfs so it seems the kernel is directly handling this - and by default the kernel has no knowledge of containers.

I wonder if it could be possible to intercept this behaviour, maybe like the shm limitations of Docker (see for example Docker Engine API v1.40 Reference, shmsize parameter)

That’s partially right but misses cgroups.
lxc exec attaches to the container and will - ignoring cgroup2 specialties which do not apply - move itself into init’s cgroup usually, i.e. it’ll attach to the same cgroup that init is usually in since init (systemd) will not move itself into a separate cgroup for the memory hierarchy:

root@b1:~# grep memory /proc/self/cgroup
9:memory:/
root@b1:~# grep memory /proc/1/cgroup
9:memory:/

So the oom killer sees a big fat cgroup and starts killing of tasks in the cgroup. But the memory killer will usually not just kill a single task it will kill multiple. So it starts with the fattest one which should be the one that goes over the memory limit but it then immediately finds the next fattest process which will usually be systemd-<some-daemon> and then another one almost guaranteed to hit systemd itself at some point taking down the container.