Lxcfs gaining high load on reload + restart of apache2 in container

We are running a high concurrency optimized webserver setup using apache and mpm_event in lxd containers. After upgrading to lxd 4.0.1 we have started to experience issues when reloading and restarting apache. lxcfs cpu usage goes to ~200% for between 2 and 30 seconds. Rare cases 120 seconds. All this time several services including apache is unavailable, and eg. running /proc related commands as top, ps, uptime stalls until lxcfs has become normal again.
We straced lxcfs while reloading apache.http://sprunge.us/ZxV8up
And found a lot of <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) lines.

Sometimes, but not consequently we also see the following in the hosts syslog:

cgroup: fork rejected by pids controller in /lxc.payload.phct-030/system.slice/apache2.service

It seems we can mostly replicate this with a rather aggressive mpm_event.conf:

StartServers                8
MinSpareThreads           100
MaxSpareThreads           300
ServerLimit              2000
ThreadLimit               256
ThreadsPerChild           100
MaxRequestWorkers        2000
MaxConnectionsPerChild   9999

System info:

root@HOSTNAME:~# lxd --version
4.0.1
root@HOSTNAME:~# /snap/lxd/current/bin/lxcfs --version
4.0.3
root@HOSTNAME:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic
root@HOSTNAME:~# uname -a
Linux HOSTNAME 5.3.0-51-generic #44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

@brauner

I can’t reproduce based on your instructions. Can you give a more detailed reproducer, please?

The “cgroup: fork rejected by pids controller” is the kernel telling you that you’ve exceeded the number of processes your cgroup is allowed. What limit have you set for the container? Could just be that your container simply isn’t allowed to spawn any more processes.

Well for me its pretty easy to reproduce. I just created a new container with the following config:

root@phhw-200106:~# lxc config show --expanded apache-issue
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian buster amd64 (20200515_05:24)
  image.os: Debian
  image.release: buster
  image.serial: "20200515_05:24"
  image.type: squashfs
  limits.cpu: "12"
  limits.memory: 65536MB
  raw.lxc: lxc.apparmor.profile = unconfined
  security.syscalls.intercept.mknod: "true"
  volatile.base_image: 9f4a68d4cc4dec23aaa9ca2d48558f73873b407f00f5ff8f9572b352ec7902c5
  volatile.eth0.host_name: veth47d04e10
  volatile.eth0.hwaddr: 00:16:3e:4e:d7:c0
  volatile.eth1.host_name: veth5c5951fa
  volatile.eth1.hwaddr: 00:16:3e:c9:fe:1f
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  eth1:
    name: eth1
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: lxd
    type: disk
ephemeral: false

Then see what happens in the container:

The worst I can get here is like 2s:

root@f3:~# cat /etc/apache2/mods-enabled/mpm_event.conf
# event MPM
# StartServers: initial number of server processes to start
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestWorkers: maximum number of worker threads
# MaxConnectionsPerChild: maximum number of requests a server process serves
<IfModule mpm_event_module>
        StartServers                8
        MinSpareThreads           100
        MaxSpareThreads           300
        ServerLimit              2000
        ThreadLimit               256
        ThreadsPerChild           100
        MaxRequestWorkers        2000
        MaxConnectionsPerChild   9999
</IfModule>

# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
 19:25:41 up 1 min,  0 users,  load average: 1.71, 1.68, 1.50

real    0m0.260s
user    0m0.002s
sys     0m0.000s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
 19:25:42 up 1 min,  0 users,  load average: 1.71, 1.68, 1.50

real    0m0.840s
user    0m0.002s
sys     0m0.000s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
 19:25:44 up 1 min,  0 users,  load average: 2.13, 1.77, 1.53

real    0m0.921s
user    0m0.000s
sys     0m0.002s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
 19:25:46 up 1 min,  0 users,  load average: 2.13, 1.77, 1.53

real    0m2.653s
user    0m0.002s
sys     0m0.000s

There was no kernel upgrade or something?

It seems the more load the host has the worse it gets.
We use HWE kernel.

Lately we’ve had containers with service disruptions at midnight for 2-10 minutes, probably due to all containers on the same host running logrotate (and thereby apache reload) all at once causing serious load on lxcfs.

Hm, and you didn’t have those issues with prior versions?

No - i did not have issues with this prior to the update. I can’t argue if this can be replicated in 3.x, but we did not experience this before upgrading.