NMI Watchdog: Bug: soft lockup - CPU stuck

belcloud · July 19, 2017, 4:33pm

Hello

This is an issue that’s been creating big problems for me in the past 6 months.

Almost every few days, some of my nodes are getting stuck, dying (only reboot solves it) with the multiple such errors:
NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksoftirqd/4:36]
INFO: rcu_sched self-detected stall on cpu
o4-…: (2 GPs behind) idle=eb0/140000000001/0 softirq=9966753/9966753 fqs=7165
NMI Watchdog: BUG: soft lockup - CPU#1 stuck for 24s! [ksoftirqd/0:7]

Full trace: http://prntscr.com/fxt4rf

This is on debian kernel 4.10.15-1 (proxmox). In lxc 2.0.6 it was showing the process like nginx or something that caused the issue, and if i managed killed that process before the server died completly, it would have fixed the issue.
However, on lxc 2.0.8, it’s simply giving the process of ksoftirqd.

In the past, one solution that helped reduce the number of such crashes was limiting the max_pids for the container. Some containers were able to crash the node with 200 max pids even. I’m using 400 as default at the moment.

Any idea what other limitations we can add to prevent such issues?

Any help would be highly appreciated!

stgraber · July 21, 2017, 9:27am

Hmm, this is very much a kernel bug. I’m not sure how an older LXC would have influenced this at all.
I’d strongly recommend you file a bug against the kernel with your distribution.

n8v8r · June 27, 2018, 6:29pm

I have come across the same issue on an ubuntu 4.15.0-x LXC host

Not sure whether it plays a role but in my case the inital bug was with the KVM hypervisor the LXC host is running on. After that been rectified the CPU freezes became less frequent and only impacted one CPU whilst prior to the fix on the KVM hypervisor both CPU suffered.

It was then recommended to increase kernel.watchdog_thresh = 10 to kernel.watchdog_thresh = 20 on the LXC host. Since then the CPU stalls are gone, not sure though whether that is a good practice or just a kludge…

n8v8r · June 29, 2018, 6:59pm

after some testing and investigation of kernel logs I would concur. From the kernel logs it is apparent that lxc is never cited as the culprit and is neither listed in Modules linked in at the time of CPU lockups.

VPS are in general proably more prone to the issue than bare metal and low spec VPS even more so.

https://kb.vmware.com/s/article/1009996

On a physical host, a soft lockup message generally indicates a kernel bug or hardware bug. When running in a virtual machine, this might instead indicate high levels of overcommitment (especially memory overcommitment) or other virtualization overheads

n8v8r · July 3, 2018, 9:05am

If the LXC host is a VPS it seems that the VPS’s hypervisor settings can cause some I/O disk issues and impact CPU performance.

In this case the hypervisor is Virtualizor based on KVM 7.4.1 with a raid6. Turned out that the disk cache settings of the hypervisor had an inclement effect and after the VPS provider adjusted the settings the performance has improved.