Number of CPUs reported by /proc/stat fluctuates, causing issues

I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds.

Okay, so, that’s not related to recent kernel changes. Good news for us.

Interestingly, an older machine with the same setup is not affected.

Do you have the same processor (128 threads) on it, or with fewer threads?

Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?

No-no, it comes from lxcfs fuse. Because we are hooking CPU count and so on.

Thanks a lot for your test with the older kernel, it’s really helpful. I’ll try to guess what happens here. On my 6 core / 12 threads machine, it’s not reproducible )-:

Can you confirm that the issue appeared after a software upgrade on your host? So, hardware parts, the number of containers on the node, and other things were not changed?

The other machines I tested were 64 and 16 threads. While testing these I did oversubscribe the CPUs and generated loads with the “stress” tool to see if it is load related.
We did add Mellanox NICs to the machines and installed its DKMS driver. I can’t exclude that being related, but only the 128 thread machine is affected. The number of containers and types of loads did not change significantly, if at all.
If you have any check to be run on the machine, let me know. The help is appreciated!

1 Like

You can try to put some threads on your 128-thread EPIC to offline mode using the CPU hotplug feature. Like this: echo 0 > /sys/devices/system/cpu/cpu65/online (then turn it on after the experiment by writing 1 to the same sysfs file). You can try to disable all threads from 65->128 and check if the issue is still reproducible (or even disable all threads except 32). There may be a hint for us.

So I was able to reproduce the issue on the 64 thread server too, it just took longer.
When looking at the temp.txt generated by the reproducer when the loop breaks, the pattern is that the number of CPUs reported by /proc/stat in that case were either 4 or the total number of the host cpus. During the looping, I also get occasional “cat: /proc/stat: Invalid argument”, which seems very wrong.

1 Like

I also went ahead with the test disabling cpus. The issue still occurs when offlining all but 32.

1 Like

Huge thanks for playing with that. Will take a look carefully at the code tomorrow.

I’ve found something and posted a pull request

It may be related to your problem, but I’m not sure. Let’s wait for other developers opinion

@zrav LXD 5.9 was released yesterday, you can try to update your snap it contains this fix for LXCFS. Hope it helps in your case. If not, then we’ll continue the investigation.

I updated to 5.9 from the candidate channel and rebooted, however the issue is still reproducible with a similar frequency.

1 Like

@zrav okay, I have an idea how we will catch this. A special build of lxcfs with ASAN and TSAN :slight_smile:
I’ll reach you.

Libfuse3 direct io by mihalicyn · Pull Request #571 · lxc/lxcfs · GitHub should help

@zrav this change was picked up in the last build. Please try snap refresh lxd and check which revision you get (it should be bigger than 24164). And yes, you’ll need to reboot.

I’m getting exactly version 24164 but I should be getting a higher one?

24164 is fine

Stéphane

1 Like

Yes, that seems to have done the trick, the issue can’t be reproduced anymore :slight_smile:
Thank you for getting in a fix so quickly. Once again I’m very impressed by the LXD team!

2 Likes

After running a few days /proc/cpuinfo and the other lxcfs mounts became unreadable:

df -h
df: /proc/cpuinfo: Transport endpoint is not connected
df: /proc/diskstats: Transport endpoint is not connected
df: /proc/loadavg: Transport endpoint is not connected
df: /proc/meminfo: Transport endpoint is not connected
df: /proc/slabinfo: Transport endpoint is not connected
df: /proc/stat: Transport endpoint is not connected
df: /proc/swaps: Transport endpoint is not connected
df: /proc/uptime: Transport endpoint is not connected
df: /sys/devices/system/cpu/online: Transport endpoint is not connected
df: /var/snap/lxd/common/var/lib/lxcfs: Transport endpoint is not connected

LXCFS did crash:

show_signal_msg: 14 callbacks suppressed
lxcfs[3219179]: segfault at 0 ip 00007f8084afdf81 sp 00007f8084a2e780 error 6 in libc-2.31.so[7f8084a94000+178000]
Code: 00 00 4c 89 ef 4c 89 4c 24 08 e8 3a 68 00 00 48 89 e9 4c 89 e2 48 89 ee 48 8d 05 2a d2 15 00 4c 89 ef 48 89 84 24 e8 00 00 00 <c6> 45 00 00 e8 06 7e 00 00 89 d9 4c 89 fa 4c 89 f6 4c 89 ef e8 c6

Hi @zrav,

this means that lxcfs fuse daemon crashed for some reason.
I think it’s better to fill an issue on Github Issues · lxc/lxcfs · GitHub

I’ll take a look and try to figure out a reason.

You’ll need to restart all containers to make lxcfs work again.

@zrav please, provide the following information:

  1. service apport status
  2. cat /proc/sys/kernel/core_pattern
  3. ls -la /var/crash
  4. ls -la /var/lib/apport/coredump/
  5. cat /var/log/apport.log
  6. journalctl -u snap.lxd.daemon -n 200