LXD 5.8 running on Ubuntu Jammy causes the number of CPUs reported by /proc/stat in the container to fluctuate, causing problems for applications expecting no change. The issue occurs regardless of limits.cpu being set or not.
This appeared for us after a server reboot after 5.8 had been auto-installed. The reboot before that was on 5.6, so we assume this broke in the LXCFS bundled with either 5.7 or 5.8. Sadly I was unable to downgrade due to the DB schema having been upgraded in 5.8 (is there a downgrade path?).
The issue can be observed running the following in the container (adjust the number to the number of container cpus+1):
Thanks for your detailed report with versions specification.
I can’t see any changes in lxc, lxcfs that can cause this problem between LXD 5.6 and LXD 5.8. It may be a kernel problem too. Couldn’t you check which kernel version you had before?
I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds. Interestingly, an older machine with the same setup is not affected.
Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?
The other machines I tested were 64 and 16 threads. While testing these I did oversubscribe the CPUs and generated loads with the “stress” tool to see if it is load related.
We did add Mellanox NICs to the machines and installed its DKMS driver. I can’t exclude that being related, but only the 128 thread machine is affected. The number of containers and types of loads did not change significantly, if at all.
If you have any check to be run on the machine, let me know. The help is appreciated!
You can try to put some threads on your 128-thread EPIC to offline mode using the CPU hotplug feature. Like this: echo 0 > /sys/devices/system/cpu/cpu65/online (then turn it on after the experiment by writing 1 to the same sysfs file). You can try to disable all threads from 65->128 and check if the issue is still reproducible (or even disable all threads except 32). There may be a hint for us.
So I was able to reproduce the issue on the 64 thread server too, it just took longer.
When looking at the temp.txt generated by the reproducer when the loop breaks, the pattern is that the number of CPUs reported by /proc/stat in that case were either 4 or the total number of the host cpus. During the looping, I also get occasional “cat: /proc/stat: Invalid argument”, which seems very wrong.