This is an issue that’s been creating big problems for me in the past 6 months.
Almost every few days, some of my nodes are getting stuck, dying (only reboot solves it) with the multiple such errors:
NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksoftirqd/4:36]
INFO: rcu_sched self-detected stall on cpu
o4-…: (2 GPs behind) idle=eb0/140000000001/0 softirq=9966753/9966753 fqs=7165
NMI Watchdog: BUG: soft lockup - CPU#1 stuck for 24s! [ksoftirqd/0:7]
Full trace: http://prntscr.com/fxt4rf
This is on debian kernel 4.10.15-1 (proxmox). In lxc 2.0.6 it was showing the process like nginx or something that caused the issue, and if i managed killed that process before the server died completly, it would have fixed the issue.
However, on lxc 2.0.8, it’s simply giving the process of ksoftirqd.
In the past, one solution that helped reduce the number of such crashes was limiting the max_pids for the container. Some containers were able to crash the node with 200 max pids even. I’m using 400 as default at the moment.
Any idea what other limitations we can add to prevent such issues?
Any help would be highly appreciated!