Weird memory page issue? (or other)

Hi… I have been experiencing the following error on one of my nodes

root@node21:~# lxc list
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fa778b79e97 m=7 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fa778b79e97
stack: frame={sp:0x7fa743479810, fp:0x0} stack=[0x7fa742c7a288,0x7fa743479e88)
00007fa743479710: 0000000001deb960 0000ffff00001fa0
00007fa743479720: 00007fa743479c20 0000000000000000
00007fa743479730: 00007fa74347a700 00000000fffffffc
00007fa743479740: 00007fa778ceed74 901d3c6def8c9700
00007fa743479750: 0000000000000007 00007fa778b6dff8
00007fa743479760: 00007fa730001080 000000000000001b
00007fa743479770: 00007fa730001080 00007fa730000e60
00007fa743479780: 0000000001deb9e0 00007fa778b70070
00007fa743479790: 0000000000000001 00007fa743479970
00007fa7434797a0: 2525252525252525 2525252525252525
00007fa7434797b0: 00000000000000ff 0000000000000000
00007fa7434797c0: 00000000000000ff 0000000000000000
00007fa7434797d0: 000000c0000101f0 000000c00007b3c0
00007fa7434797e0: 000000c0002f18c0 000000c0002f4580
00007fa7434797f0: 000000c0002f4840 000000c0002ed3b0
00007fa743479800: 000000c0002ed430 000000c0002a5020
00007fa743479810: <0000000000000000 000000c00025f0c0
00007fa743479820: 20636f6c6c616d00 203a64656c696166
00007fa743479830: 0000000000000000 0000000000000000
00007fa743479840: 000000c0001c4840 000000c0001c4660
00007fa743479850: 000000c0000cd720 000000c0001c4420
00007fa743479860: 000000c0001c4480 000000c0001bc120
00007fa743479870: 0000000000000000 0000000000000000
00007fa743479880: 0000000000000000 0000000000000000
00007fa743479890: fffffffe7fffffff ffffffffffffffff
00007fa7434798a0: ffffffffffffffff ffffffffffffffff
00007fa7434798b0: ffffffffffffffff ffffffffffffffff
00007fa7434798c0: ffffffffffffffff ffffffffffffffff
00007fa7434798d0: ffffffffffffffff ffffffffffffffff
00007fa7434798e0: ffffffffffffffff ffffffffffffffff
00007fa7434798f0: ffffffffffffffff ffffffffffffffff
00007fa743479900: ffffffffffffffff ffffffffffffffff
runtime: unknown pc 0x7fa778b79e97
stack: frame={sp:0x7fa743479810, fp:0x0} stack=[0x7fa742c7a288,0x7fa743479e88)
00007fa743479710: 0000000001deb960 0000ffff00001fa0
00007fa743479720: 00007fa743479c20 0000000000000000
00007fa743479730: 00007fa74347a700 00000000fffffffc
00007fa743479740: 00007fa778ceed74 901d3c6def8c9700
00007fa743479750: 0000000000000007 00007fa778b6dff8
00007fa743479760: 00007fa730001080 000000000000001b
00007fa743479770: 00007fa730001080 00007fa730000e60
00007fa743479780: 0000000001deb9e0 00007fa778b70070
00007fa743479790: 0000000000000001 00007fa743479970
00007fa7434797a0: 2525252525252525 2525252525252525
00007fa7434797b0: 00000000000000ff 0000000000000000
00007fa7434797c0: 00000000000000ff 0000000000000000
00007fa7434797d0: 000000c0000101f0 000000c00007b3c0
00007fa7434797e0: 000000c0002f18c0 000000c0002f4580
00007fa7434797f0: 000000c0002f4840 000000c0002ed3b0
00007fa743479800: 000000c0002ed430 000000c0002a5020
00007fa743479810: <0000000000000000 000000c00025f0c0
00007fa743479820: 20636f6c6c616d00 203a64656c696166
00007fa743479830: 0000000000000000 0000000000000000
00007fa743479840: 000000c0001c4840 000000c0001c4660
00007fa743479850: 000000c0000cd720 000000c0001c4420
00007fa743479860: 000000c0001c4480 000000c0001bc120
00007fa743479870: 0000000000000000 0000000000000000
00007fa743479880: 0000000000000000 0000000000000000
00007fa743479890: fffffffe7fffffff ffffffffffffffff
00007fa7434798a0: ffffffffffffffff ffffffffffffffff
00007fa7434798b0: ffffffffffffffff ffffffffffffffff
00007fa7434798c0: ffffffffffffffff ffffffffffffffff
00007fa7434798d0: ffffffffffffffff ffffffffffffffff
00007fa7434798e0: ffffffffffffffff ffffffffffffffff
00007fa7434798f0: ffffffffffffffff ffffffffffffffff
00007fa743479900: ffffffffffffffff ffffffffffffffff

goroutine 1 [runnable]:
encoding/json.stateInString(0xc000699340, 0xc35e69, 0x0)
/snap/go/5830/src/encoding/json/scanner.go:328 +0x216
encoding/json.Indent(0xc0003b3690, 0xc0004a8000, 0x33b5a, 0x34000, 0xc35e5d, 0x1, 0xc35e5d, 0x1, 0x0, 0x0)
/snap/go/5830/src/encoding/json/indent.go:89 +0xd5
encoding/json.MarshalIndent(0xb03360, 0xc000114800, 0xc35e5d, 0x1, 0xc35e5d, 0x1, 0xc00025fc20, 0xc000172c60, 0x1c, 0x0, …)
/snap/go/5830/src/encoding/json/encode.go:181 +0xc8
github.com/lxc/lxd/shared/logger.Pretty(0xb03360, 0xc000114800, 0x0, 0x0)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/shared/logger/format.go:11 +0x5f
github.com/lxc/lxd/client.(*ProtocolLXD).queryStruct(0xc0003085b0, 0xc36187, 0x3, 0xc00011f260, 0x16, 0x0, 0x0, 0x0, 0x0, 0xb03360, …)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/client/lxd.go:298 +0x204
github.com/lxc/lxd/client.(*ProtocolLXD).GetInstancesFull(0xc0003085b0, 0x0, 0x0, 0x1, 0x8, 0x1, 0x0, 0x0)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/client/lxd_instances.go:110 +0x2c8
main.(*cmdList).Run(0xc0002a4540, 0xc0002acdc0, 0x1476910, 0x0, 0x0, 0x0, 0x0)
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxc/list.go:394 +0x21d
github.com/spf13/cobra.(*Command).execute(0xc0002acdc0, 0x1476910, 0x0, 0x0, 0xc0002acdc0, 0x1476910)
/build/lxd/parts/lxd/go/src/github.com/spf13/cobra/command.go:842 +0x453
github.com/spf13/cobra.(*Command).ExecuteC(0xc0002642c0, 0xc00000e090, 0x1, 0x1)
/build/lxd/parts/lxd/go/src/github.com/spf13/cobra/command.go:950 +0x349
github.com/spf13/cobra.(*Command).Execute(…)
/build/lxd/parts/lxd/go/src/github.com/spf13/cobra/command.go:887
main.main()
/build/lxd/parts/lxd/go/src/github.com/lxc/lxd/lxc/main.go:238 +0x1c91

rax 0x0
rbx 0x7fa778f27840
rcx 0x7fa778b79e97
rdx 0x0
rdi 0x2
rsi 0x7fa743479810
rbp 0xda8db7
rsp 0x7fa743479810
r8 0x0
r9 0x7fa743479810
r10 0x8
r11 0x246
r12 0x1de87e0
r13 0x0
r14 0xd4c768
r15 0x0
rip 0x7fa778b79e97
rflags 0x246
cs 0x33
fs 0x0
gs 0x0

I have the following in /etc/security/limits.conf:

*       hard    nofile  1048576
root    soft    nofile  1048576
root    hard    nofile  1048576
*       soft    memlock unlimited
*       hard    memlock unlimited

and in /etcsysctl.conf I have:

fs.inotify.max_queued_events      = 1048576
fs.inotify.max_user_instances     = 1048576
fs.inotify.max_user_watches       = 1048576
vm.max_map_count                  = 262144
kernel.dmesg_restrict             = 1
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv6.neigh.default.gc_thresh3 = 8192
net.core.bpf_jit_limit            = 3000000000
kernel.keys.maxkeys               = 2000
kernel.keys.maxbytes              = 2000000	
net.core.netdev_max_backlog       = 182757

Any clues?

It seems to be not only lxc/lxd related. KVMs have been crashing and shutting down, and some fork messages show up (I dpnt have anything on hand right now)

The only solution is to reboot the server.

I had no issues on this production server for a long time… but I added more containers to IT and some extra KVMs… (It has 24 cores, 128GB of RAM… )

sometimes free -h showed wound 20G free and no swap being used… but then I can’t execute free command anymore

Thanks
Luis

Seems that we found the issue.

There was one particular container where the developer has set a cron job to run every minute performing a query on a non indexed huge mongo db.

the query was obviously not finishing before the other one was posted, slowing down the system terribly and creating an infinite number of active processes…

ps -eLf | wc -l was increasing considerably by the time passing by.

this created a lot of thread/processes in one particular container. and since the limits are not set by default per the container, it was crashing the whole host.

commenting the cron job fixed the issue… now the number of processes is content and not increasing by the minute.

1 Like