Number of CPUs reported by /proc/stat fluctuates, causing issues

zrav · November 29, 2022, 7:31am

LXD 5.8 running on Ubuntu Jammy causes the number of CPUs reported by /proc/stat in the container to fluctuate, causing problems for applications expecting no change. The issue occurs regardless of limits.cpu being set or not.

This appeared for us after a server reboot after 5.8 had been auto-installed. The reboot before that was on 5.6, so we assume this broke in the LXCFS bundled with either 5.7 or 5.8. Sadly I was unable to downgrade due to the DB schema having been upgraded in 5.8 (is there a downgrade path?).

The issue can be observed running the following in the container (adjust the number to the number of container cpus+1):

cat /proc/stat > temp.txt

while [ $(cat temp.txt | grep 'cpu' | wc -l) == 29 ]
do
echo "29"
cat /proc/stat > temp.txt
done

Due to this, we are experiencing frequent issues with node.js/libuv, which for whatever reason checks the number of cpus multiple times:

[INFO] ng build --base-href ./: ../deps/uv/src/unix/linux-core.c:615: read_times: Assertion `num == numcpus' failed.

tomp · November 29, 2022, 8:16am

One for @amikhalitsyn

amikhalitsyn · November 29, 2022, 9:54am

Hi @zrav!

Thanks for your detailed report with versions specification.
I can’t see any changes in lxc, lxcfs that can cause this problem between LXD 5.6 and LXD 5.8. It may be a kernel problem too. Couldn’t you check which kernel version you had before?

Link to pretty similar issue from GitHub:

github.com/lxc/lxcfs

Overlap used memory/cpu/uptime etc inside containers

opened 04:15PM - 03 Nov 22 UTC

gribchenko

Incomplete

* Distribution: Ubuntu * Distribution version: 20.04.5 LTS * The output… of "lxc info" or if that fails: lxc info ct7839 Name: ct7839 Status: RUNNING Type: container Architecture: x86_64 PID: 2404105 Created: 2022/10/25 06:19 EEST Last Used: 2022/11/03 18:12 EET Resources: Processes: 39 Disk usage: root: 5.40GiB CPU usage: CPU usage (in seconds): 2 Memory usage: Memory (current): 184.11MiB Memory (peak): 186.48MiB Network usage: eth0: Type: broadcast State: UP Host interface: veth17b24774 MAC address: 00:16:3e:4f:f2:29 MTU: 1500 Bytes received: 13.33kB Bytes sent: 39.75kB Packets received: 154 Packets sent: 319 IP addresses: inet: *.*.*.248/32 (global) lo: Type: loopback State: UP MTU: 65536 Bytes received: 0B Bytes sent: 0B Packets received: 0 Packets sent: 0 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local) * Kernel version: 5.4.0-131-generic lxc/lxd#147-Ubuntu * LXC version: 5.7 * LXD version: 5.7 * Storage backend in use: # Issue description When run top inside containers seen flapping info from other containers top - 18:13:19 up 9 days, 13:49, 0 users, load average: 0.00, 0.00, 0.00 Tasks: 22 total, 1 running, 21 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 16384.0 total, 15026.1 free, 659.4 used, 698.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 15724.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 151508 10100 7928 S 0.0 0.1 0:00.26 systemd 77 root 20 0 187400 31912 31096 S 0.0 0.2 0:00.06 systemd-journal 99 root 20 0 17192 3156 2860 S 0.0 0.0 0:00.00 cron 100 message+ 20 0 8268 4220 3760 S 0.0 0.0 0:00.01 dbus-daemon 110 root 20 0 1116604 7264 5648 S 0.0 0.0 0:00.01 hosting-agent 128 root 20 0 880408 27180 25196 S 0.0 0.2 0:00.04 syslog-ng 131 root 20 0 13420 5596 4952 S 0.0 0.0 0:00.05 systemd-logind 139 root 20 0 258860 45900 35568 S 0.0 0.3 0:00.05 /usr/sbin/apach 140 root 20 0 1048440 6744 4576 S 0.0 0.0 0:00.00 process-slayer 141 root 20 0 13372 7140 6264 S 0.0 0.0 0:00.00 sshd 143 root 20 0 15540 2148 2032 S 0.0 0.0 0:00.00 agetty 153 apache 20 0 57140 7920 4100 S 0.0 0.0 0:00.00 /usr/sbin/apach 154 apache 20 0 58316 7396 3576 S 0.0 0.0 0:00.00 /usr/sbin/apach 156 root 20 0 19796 1308 0 S 0.0 0.0 0:00.00 nginx 157 apache 20 0 24380 7712 2220 S 0.0 0.0 0:00.00 nginx 180 apache 20 0 259380 19636 9252 S 0.0 0.1 0:00.00 /usr/sbin/apach 181 apache 20 0 259380 19636 9252 S 0.0 0.1 0:00.00 /usr/sbin/apach 369 nobody 20 0 29192 2572 1628 S 0.0 0.0 0:00.00 proftpd 370 mysql 20 0 1802904 76400 22648 S 0.0 0.5 0:00.13 mariadbd 420 Debian-+ 20 0 16884 6376 5304 S 0.0 0.0 0:00.12 exim4 516 root 20 0 20120 5132 3600 S 0.0 0.0 0:00.01 bash 519 root 20 0 20108 3768 3224 R 0.0 0.0 0:00.03 top in 10 sec top - 18:13:30 up 0 min, 0 users, load average: 0.00, 0.00, 0.00 Tasks: 22 total, 1 running, 21 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 16384.0 total, 15025.4 free, 660.1 used, 698.5 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 15723.9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 151508 10100 7928 S 0.0 0.1 0:00.26 systemd 77 root 20 0 187400 31856 31040 S 0.0 0.2 0:00.06 systemd-journal 99 root 20 0 17192 3156 2860 S 0.0 0.0 0:00.00 cron 100 message+ 20 0 8268 4216 3756 S 0.0 0.0 0:00.01 dbus-daemon 110 root 20 0 1116604 7264 5648 S 0.0 0.0 0:00.01 hosting-agent 128 root 20 0 880408 27176 25192 S 0.0 0.2 0:00.04 syslog-ng 131 root 20 0 13420 5596 4952 S 0.0 0.0 0:00.05 systemd-logind 139 root 20 0 258860 45900 35568 S 0.0 0.3 0:00.05 /usr/sbin/apach 140 root 20 0 1048440 6744 4576 S 0.0 0.0 0:00.00 process-slayer 141 root 20 0 13372 7140 6264 S 0.0 0.0 0:00.00 sshd 143 root 20 0 15540 2148 2032 S 0.0 0.0 0:00.00 agetty 153 apache 20 0 57140 7920 4100 S 0.0 0.0 0:00.00 /usr/sbin/apach 154 apache 20 0 58316 7396 3576 S 0.0 0.0 0:00.00 /usr/sbin/apach 156 root 20 0 19796 1308 0 S 0.0 0.0 0:00.00 nginx 157 apache 20 0 24380 7704 2212 S 0.0 0.0 0:00.00 nginx 180 apache 20 0 259380 19636 9252 S 0.0 0.1 0:00.00 /usr/sbin/apach 181 apache 20 0 259380 19636 9252 S 0.0 0.1 0:00.00 /usr/sbin/apach 369 nobody 20 0 29192 2572 1628 S 0.0 0.0 0:00.00 proftpd 370 mysql 20 0 1802904 76400 22648 S 0.0 0.5 0:00.13 mariadbd 420 Debian-+ 20 0 16884 6376 5304 S 0.0 0.0 0:00.12 exim4 516 root 20 0 20120 5132 3600 S 0.0 0.0 0:00.01 bash 519 root 20 0 20108 3768 3224 R 0.0 0.0 0:00.04 top A brief description of the problem. Should include what you were attempting to do, what you did, what happened and what you expected to see happen. # Steps to reproduce 1. Step one 2. Step two 3. Step three # Information to attach Config : lxc config show ct7839 --expanded architecture: x86_64 config: boot.autostart: "true" limits.cpu: "6" limits.hugepages.1GB: 4GiB limits.hugepages.2MB: 100MiB limits.memory: 16GiB limits.memory.enforce: hard limits.memory.swap: "false" limits.processes: "800" raw.lxc: lxc.cgroup.memory.oom_control=1 security.idmap.base: "51345450" security.idmap.isolated: "true" user.fqdn: *1.m*.net volatile.cloud-init.instance-id: 3479b700-4ad6-4a3f-a423-650bebca2138 volatile.eth0.host_name: veth17b24774 volatile.eth0.hwaddr: 00:16:3e:4f:f2:29 volatile.idmap.base: "51345450" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":51345450,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":51345450,"Nsid":0,"Maprange":65536}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":51345450,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":51345450,"Nsid":0,"Maprange":65536}]' volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":51345450,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":51345450,"Nsid":0,"Maprange":65536}]' volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 78df104f-44ce-4969-92df-02761d2bbc71 devices: eth0: ipv4.address: *.*.*.248 mtu: "1500" name: eth0 nictype: routed type: nic resolvconf: path: /etc/resolv.conf readonly: "true" source: /etc/resolv.conf type: disk root: limits.read: 2000iops limits.write: 1000iops path: / pool: default size: 256GiB type: disk shared: path: /shared readonly: "true" source: /shared type: disk ephemeral: false profiles: - default stateful: false description: "" - [ ] Any relevant kernel output (`dmesg`) - [ ] Container log (`lxc info NAME --show-log`) - [ ] Container configuration (`lxc config show NAME --expanded`) - [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log) - [ ] Output of the client with --debug - [ ] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue)

amikhalitsyn · November 29, 2022, 10:54am

How many containers do you have (on the physical node)?

zrav · November 29, 2022, 11:06am

We went from kernel 5.15.0-50 to 5.15.0-53.
There are currently 60 containers on this physical machine, an Epyc with 128 threads, and I’d call the load medium.

amikhalitsyn · November 29, 2022, 2:12pm

I can’t see any suspicious commits between 50/53 Ubuntu 20.04 (jammy) kernels, except one:
# proc: Fix a dentry lock race between release_task and lookup

I suggest you try rebooting to the older kernel and check if it helps. It will be good for us to understand that this is not kernel-related.

From lxc/lxcfs side there were no suspicious changes at all between LXD 5.6/5.8.

zrav · November 30, 2022, 10:15am

I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds. Interestingly, an older machine with the same setup is not affected.
Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?

amikhalitsyn · November 30, 2022, 10:37am

I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds.

Okay, so, that’s not related to recent kernel changes. Good news for us.

Interestingly, an older machine with the same setup is not affected.

Do you have the same processor (128 threads) on it, or with fewer threads?

Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?

No-no, it comes from lxcfs fuse. Because we are hooking CPU count and so on.

Thanks a lot for your test with the older kernel, it’s really helpful. I’ll try to guess what happens here. On my 6 core / 12 threads machine, it’s not reproducible )-:

amikhalitsyn · November 30, 2022, 10:47am

Can you confirm that the issue appeared after a software upgrade on your host? So, hardware parts, the number of containers on the node, and other things were not changed?

zrav · November 30, 2022, 11:00am

The other machines I tested were 64 and 16 threads. While testing these I did oversubscribe the CPUs and generated loads with the “stress” tool to see if it is load related.
We did add Mellanox NICs to the machines and installed its DKMS driver. I can’t exclude that being related, but only the 128 thread machine is affected. The number of containers and types of loads did not change significantly, if at all.
If you have any check to be run on the machine, let me know. The help is appreciated!

amikhalitsyn · November 30, 2022, 11:06am

You can try to put some threads on your 128-thread EPIC to offline mode using the CPU hotplug feature. Like this: echo 0 > /sys/devices/system/cpu/cpu65/online (then turn it on after the experiment by writing 1 to the same sysfs file). You can try to disable all threads from 65->128 and check if the issue is still reproducible (or even disable all threads except 32). There may be a hint for us.

zrav · November 30, 2022, 8:26pm

So I was able to reproduce the issue on the 64 thread server too, it just took longer.
When looking at the temp.txt generated by the reproducer when the loop breaks, the pattern is that the number of CPUs reported by /proc/stat in that case were either 4 or the total number of the host cpus. During the looping, I also get occasional “cat: /proc/stat: Invalid argument”, which seems very wrong.

zrav · November 30, 2022, 8:42pm

I also went ahead with the test disabling cpus. The issue still occurs when offlining all but 32.

amikhalitsyn · November 30, 2022, 11:12pm

Huge thanks for playing with that. Will take a look carefully at the code tomorrow.

amikhalitsyn · December 2, 2022, 12:04pm

I’ve found something and posted a pull request

It may be related to your problem, but I’m not sure. Let’s wait for other developers opinion

amikhalitsyn · December 13, 2022, 10:41am

@zrav LXD 5.9 was released yesterday, you can try to update your snap it contains this fix for LXCFS. Hope it helps in your case. If not, then we’ll continue the investigation.

zrav · December 13, 2022, 9:39pm

I updated to 5.9 from the candidate channel and rebooted, however the issue is still reproducible with a similar frequency.

amikhalitsyn · December 13, 2022, 9:55pm

@zrav okay, I have an idea how we will catch this. A special build of lxcfs with ASAN and TSAN
I’ll reach you.

amikhalitsyn · December 16, 2022, 3:27pm

Libfuse3 direct io by mihalicyn · Pull Request #571 · lxc/lxcfs · GitHub should help

amikhalitsyn · December 16, 2022, 6:39pm

@zrav this change was picked up in the last build. Please try snap refresh lxd and check which revision you get (it should be bigger than 24164). And yes, you’ll need to reboot.