We are running a high concurrency optimized webserver setup using apache and mpm_event in lxd containers. After upgrading to lxd 4.0.1 we have started to experience issues when reloading and restarting apache. lxcfs cpu usage goes to ~200% for between 2 and 30 seconds. Rare cases 120 seconds. All this time several services including apache is unavailable, and eg. running /proc related commands as top, ps, uptime stalls until lxcfs has become normal again.
We straced lxcfs while reloading apache.http://sprunge.us/ZxV8up
And found a lot of <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) lines.
Sometimes, but not consequently we also see the following in the hosts syslog:
cgroup: fork rejected by pids controller in /lxc.payload.phct-030/system.slice/apache2.service
It seems we can mostly replicate this with a rather aggressive mpm_event.conf:
I can’t reproduce based on your instructions. Can you give a more detailed reproducer, please?
The “cgroup: fork rejected by pids controller” is the kernel telling you that you’ve exceeded the number of processes your cgroup is allowed. What limit have you set for the container? Could just be that your container simply isn’t allowed to spawn any more processes.
root@f3:~# cat /etc/apache2/mods-enabled/mpm_event.conf
# event MPM
# StartServers: initial number of server processes to start
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestWorkers: maximum number of worker threads
# MaxConnectionsPerChild: maximum number of requests a server process serves
<IfModule mpm_event_module>
StartServers 8
MinSpareThreads 100
MaxSpareThreads 300
ServerLimit 2000
ThreadLimit 256
ThreadsPerChild 100
MaxRequestWorkers 2000
MaxConnectionsPerChild 9999
</IfModule>
# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
19:25:41 up 1 min, 0 users, load average: 1.71, 1.68, 1.50
real 0m0.260s
user 0m0.002s
sys 0m0.000s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
19:25:42 up 1 min, 0 users, load average: 1.71, 1.68, 1.50
real 0m0.840s
user 0m0.002s
sys 0m0.000s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
19:25:44 up 1 min, 0 users, load average: 2.13, 1.77, 1.53
real 0m0.921s
user 0m0.000s
sys 0m0.002s
root@f3:~# apache2ctl graceful && time uptime
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
19:25:46 up 1 min, 0 users, load average: 2.13, 1.77, 1.53
real 0m2.653s
user 0m0.002s
sys 0m0.000s
Lately we’ve had containers with service disruptions at midnight for 2-10 minutes, probably due to all containers on the same host running logrotate (and thereby apache reload) all at once causing serious load on lxcfs.
No - i did not have issues with this prior to the update. I can’t argue if this can be replicated in 3.x, but we did not experience this before upgrading.