I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds.
Okay, so, that’s not related to recent kernel changes. Good news for us.
Interestingly, an older machine with the same setup is not affected.
Do you have the same processor (128 threads) on it, or with fewer threads?
Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?
No-no, it comes from lxcfs fuse. Because we are hooking CPU count and so on.
Thanks a lot for your test with the older kernel, it’s really helpful. I’ll try to guess what happens here. On my 6 core / 12 threads machine, it’s not reproducible )-:
Can you confirm that the issue appeared after a software upgrade on your host? So, hardware parts, the number of containers on the node, and other things were not changed?
The other machines I tested were 64 and 16 threads. While testing these I did oversubscribe the CPUs and generated loads with the “stress” tool to see if it is load related.
We did add Mellanox NICs to the machines and installed its DKMS driver. I can’t exclude that being related, but only the 128 thread machine is affected. The number of containers and types of loads did not change significantly, if at all.
If you have any check to be run on the machine, let me know. The help is appreciated!
You can try to put some threads on your 128-thread EPIC to offline mode using the CPU hotplug feature. Like this: echo 0 > /sys/devices/system/cpu/cpu65/online (then turn it on after the experiment by writing 1 to the same sysfs file). You can try to disable all threads from 65->128 and check if the issue is still reproducible (or even disable all threads except 32). There may be a hint for us.
So I was able to reproduce the issue on the 64 thread server too, it just took longer.
When looking at the temp.txt generated by the reproducer when the loop breaks, the pattern is that the number of CPUs reported by /proc/stat in that case were either 4 or the total number of the host cpus. During the looping, I also get occasional “cat: /proc/stat: Invalid argument”, which seems very wrong.
@zrav LXD 5.9 was released yesterday, you can try to update your snap it contains this fix for LXCFS. Hope it helps in your case. If not, then we’ll continue the investigation.
@zrav this change was picked up in the last build. Please try snap refresh lxd and check which revision you get (it should be bigger than 24164). And yes, you’ll need to reboot.
Yes, that seems to have done the trick, the issue can’t be reproduced anymore
Thank you for getting in a fix so quickly. Once again I’m very impressed by the LXD team!
After running a few days /proc/cpuinfo and the other lxcfs mounts became unreadable:
df -h
df: /proc/cpuinfo: Transport endpoint is not connected
df: /proc/diskstats: Transport endpoint is not connected
df: /proc/loadavg: Transport endpoint is not connected
df: /proc/meminfo: Transport endpoint is not connected
df: /proc/slabinfo: Transport endpoint is not connected
df: /proc/stat: Transport endpoint is not connected
df: /proc/swaps: Transport endpoint is not connected
df: /proc/uptime: Transport endpoint is not connected
df: /sys/devices/system/cpu/online: Transport endpoint is not connected
df: /var/snap/lxd/common/var/lib/lxcfs: Transport endpoint is not connected
LXCFS did crash:
show_signal_msg: 14 callbacks suppressed
lxcfs[3219179]: segfault at 0 ip 00007f8084afdf81 sp 00007f8084a2e780 error 6 in libc-2.31.so[7f8084a94000+178000]
Code: 00 00 4c 89 ef 4c 89 4c 24 08 e8 3a 68 00 00 48 89 e9 4c 89 e2 48 89 ee 48 8d 05 2a d2 15 00 4c 89 ef 48 89 84 24 e8 00 00 00 <c6> 45 00 00 e8 06 7e 00 00 89 d9 4c 89 fa 4c 89 f6 4c 89 ef e8 c6