LXD - Cannot scale over 597 containers - seccomp errors

Hey,

I am trying to evaluate LXD for a mass container solution and I just build up my first node with upstream snap version of LXD on a baremetal Ubuntu 18.04 LTS. Works everything fine, but when I scale over now exactly 597 containers, the next containers fail to start.

I researched and found different solutions for this error and already tested it with the optimizations from https://lxd.readthedocs.io/en/latest/production-setup/. I also raised the values to maybe not good values, only to be sure.

For the test I just spawned 1000 containers with Ubuntu 18.04 image without network interface, sitting on local ZFS storage. The machine is big enough. AMD EPYC 64 Threads, 256GB RAM.

The interesting part is, that Proxmox VE6 upstream with their LXC implementation of 3.21 has the same problem. We got the same error there. They are using the same 5.3 kernel as I am using on the Ubuntu 18.04 LTS, just on Debian Buster.

I can also deliver SSH access, if somebody need a deeper view in the system.

I wonder what limit I get here. Output of all necessary infos bottom:

sysctl.conf:
vm.max_map_count = 262144
fs.inotify.max_queued_events = 167772160
fs.inotify.max_user_instances = 167772160
fs.inotify.max_user_watches = 167772160
kernel.keys.maxkeys = 80000
kernel.dmesg_restrict = 1
kernel.pid_max = 4194304

/etc/security/limits.conf:
* soft nofile 167772160
* hard nofile 167772160
root soft nofile 167772160
root hard nofile 167772160
* soft memlock unlimited
* hard memlock unlimited

root@lxd-test ~ # lxc start wondrous-spider
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart wondrous-spider /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/wondrous-spider/lxc.conf:
Try lxc info --show-log wondrous-spider for more info

root@lxd-test ~ # lxc info --show-log wondrous-spider
Name: wondrous-spider
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/03/08 15:48 UTC
Status: Stopped
Type: container
Profiles: default

lxc info --show-log wondrous-spider
lxc wondrous-spider 20200308173416.263 WARN cgfsng - cgroups/cgfsng.c:chowmod:1525 - No such file or directory - Failed to chown(/sys/fs/cgroup/unified//lxc.payload/wondrous-spider/memory.oom.group, 1000000000, 0)
lxc wondrous-spider 20200308173416.284 ERROR utils - utils.c:lxc_setup_keyring:1856 - Disk quota exceeded - Failed to create kernel keyring
lxc wondrous-spider 20200308173416.353 ERROR seccomp - seccomp.c:lxc_seccomp_load:1252 - Unknown error 524 - Error loading the seccomp policy
lxc wondrous-spider 20200308173416.353 ERROR sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc wondrous-spider 20200308173416.353 ERROR start - start.c:lxc_abort:1122 - Function not implemented - Failed to send SIGKILL to 438070
lxc wondrous-spider 20200308173416.353 ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:873 - Received container state “ABORTING” instead of “RUNNING”
lxc wondrous-spider 20200308173416.356 ERROR start - start.c:__lxc_start:2039 - Failed to spawn container “wondrous-spider”

@brauner any idea?

It’s also odd that we’re seeing a keyring error despite the raised sysctl for it. Can you confirm that kernel.keys.maxkeys is properly set?

Hey,

Yes its set correctly also did a reboot for safety. I just did a kernel switch to standard 4.15, that was a huge mistake. The performance is a mess. Load is way too high with that amount of containers, I think with that kernel there is some other (many) things which gone fault. But let not go further into that, because 5.3 is the target here. The load with 5.3 was just perfect. Almost always under 10 load, even when all containers were starting at boot.

root@lxd-test / # sysctl -a | grep maxkeys
kernel.keys.maxkeys = 80000
kernel.keys.root_maxkeys = 1000000

The issue isn’t with the maximum number of keys but with the size of the keyring I think. Try raising /proc/sys/kernel/keys/maxbytes significantly and report back, please.

@brauner This was the solution maybe. I am now at 650 working containers, never reached that. I will now deploy more and report back.

@brauner It is not a solution. The 650 could be just an aberration. I just set them this high. This run 600 was the maximum again.

root@lxd-test ~ # sysctl -a | grep keys
kernel.keys.gc_delay = 300
kernel.keys.maxbytes = 2000000000
kernel.keys.maxkeys = 1000000000
kernel.keys.persistent_keyring_expiry = 259200
kernel.keys.root_maxbytes = 2000000000
kernel.keys.root_maxkeys = 1000000000

There is a difference in the log file though. There is no message about Disk quota anymore. The rest seems the same.

lxc modest-cricket 20200308232720.636 WARN cgfsng - cgroups/cgfsng.c:chowmod:1525 - No such file or directory - Failed to chown(/sys/fs/cgroup/unified//lxc.payload/modest-cricket/memory.oom.group, 1000000000, 0)
lxc modest-cricket 20200308232720.140 ERROR seccomp - seccomp.c:lxc_seccomp_load:1252 - Unknown error 524 - Error loading the seccomp policy
lxc modest-cricket 20200308232720.140 ERROR sync - sync.c:__sync_wait:62 - An error occurred in another process (expected sequence number 5)
lxc modest-cricket 20200308232720.140 ERROR start - start.c:lxc_abort:1122 - Function not implemented - Failed to send SIGKILL to 699743
lxc modest-cricket 20200308232720.140 ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:873 - Received container state “ABORTING” instead of “RUNNING”
lxc modest-cricket 20200308232720.142 ERROR start - start.c:__lxc_start:2039 - Failed to spawn container “modest-cricket”

Ok, if you’d give me access to the machine that would help. You can send the creds to my mail. ssh key is on my public GitHub.

@brauner Sure, had a long night, let me first get into office and you get all the data. Comes in about one hour. Thanks for the assistance!

@brauner Just sent you an email. Check your inbox.

Will do!

In short, this is caused by seccomp running on a kernel that has CONFIG_BPF_JIT_ALWAYS_ON=y set. If a certain number of containers is reached then the ebpf jit will fail. I’ll try to debug this a little now.
If you want to proceed with your testing, you can set:

lxc profile set default raw.lxc 'lxc.seccomp.profile ='

Ok, a full fix is bumping the jit limit significantly:

echo "high number you feel comfortable with" > /proc/sys/net/core/bpf_jit_limit

@brauner Thank you very much for your help. It seems to be the solution. You have configured net.core.bpf_jit_limit to 400000000. I created more containers, until 900 running containers, then the same error appeared. But then I raised the value to 800000000. Now I can add more running containers and still raising.

I will report back how many containers I could reach. Keep up with your good work! And thank you very much!

Maybe it is a good idea to put those information on the production scaling documentation I posted before. Other people may hit this limit.

Another Edit:
@brauner I was able to scale over 1000 containers but then I got some RAM problems, because I did not do the optimizations on ZFS ARC limits etc. But seems really really good now :slight_smile:

Cool! I added more info into our production-setup guid. @stgraber should merge it soon.