LXD 5.9 crashes on Centos 7

Hi,
Mother server - newest Centos 7 (7.9.2009)
LXD container - also newest Centos 7
lxd package version on mother server - 5.9-9879096 installed by Snap.

And after some time container crashes little bit. I mean I can ping from it any IP on internet so network looks is working. But e.g. df command or others return following errors:

df: ‘/proc/cpuinfo’: Transport endpoint is not connected
df: ‘/proc/diskstats’: Transport endpoint is not connected
df: ‘/proc/loadavg’: Transport endpoint is not connected
df: ‘/proc/meminfo’: Transport endpoint is not connected
df: ‘/proc/slabinfo’: Transport endpoint is not connected
df: ‘/proc/stat’: Transport endpoint is not connected
df: ‘/proc/swaps’: Transport endpoint is not connected
df: ‘/proc/uptime’: Transport endpoint is not connected
df: ‘/sys/devices/system/cpu/online’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/blkio’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpu’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuset’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/devices’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/freezer’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/hugetlb’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/memory’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/net_cls’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/perf_event’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/pids’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/systemd’: Transport endpoint is not connected

And on mother server there is error in dmesg:
[309274.787698] lxcfs[29285]: segfault at 0 ip 00007f96c28d03ce sp 00007f96c15dcc38 error 4 in libc-2.31.so[7f96c2848000+178000]

Does anybody have idea how to solve that issue ?
Thank you.

Hmm, yeah, that’s a LXCFS crash.
Does that happen repeatedly?

To recover from this you’d need to:

  • systemctl reload snap.lxd.daemon
  • lxc restart --all

Which will both reload LXD (and restart LXCFS) and then restart all the containers on the system.

Looks related lxcfs crash on lxd 5.9 rev 24164 · Issue #573 · lxc/lxcfs · GitHub

@wrkilu you need to setup your core_pattern the same way as I’ve described here:

To catch core. BTW, you can check your current /proc/sys/kernel/core_pattern, if we are lucky then possibly you already have coredump collected in /var/crash/...

@wrkilu couldn’t you also check your kernel logs for line with:

kernel: Code 

It should follow the line with “segfault” info. Please, post it too.

@stgraber
I’ve done “lxc stop test1 -f”, then “systemctl reload snap.lxd.daemon”
Container has started again and worked about 3h. And then problem got back…

@amikhalitsyn
There aren’t other important lines in dmesg around this Segfault:
[424483.934921] lxdbr0: port 1(veth4aec45e0) entered forwarding state
[431674.370712] lxcfs[1237]: segfault at 0 ip 00007f89b23d83ce sp 00007f89b09a1c38 error 4 in libc-2.31.so[7f89b2350000+178000]
[436810.186476] logflags DROP IN=enp4s0 OUT=enp4s0 MAC=54:04:a6:f1:77:83:30:b6:4f:d8:00:d2:08:00

Also I have to mention that this container (I don’t have others yet), has RAM limit 1GB and after “systemctl reload snap.lxd.daemon” it started with maximum mother RAM size (16GB). Then i rebooted him from inside and he started with 1GB. And then as I wrote after 3h he got these errors with lxcfs.

On mother:
cat /proc/sys/kernel/core_pattern
core

/var/crash is empty

The memory reporting behavior you described is correct for the crash you’re experiencing. You need to reload LXD to have LXCFS restored at which point restarting a container will have it use the new LXCFS instance and so report the memory consumption correctly. Restarting the container prior to restarting LXCFS will leave it seeing the memory information of the host system.

Now it’d be nice if we could indeed grab a core out of this thing since you seem to have it in a state that’s mostly reproducible…

Any idea what may be happening inside of your container at the time of the LXCFS crash?
If we can figure that out, then we could probably grab both a strace and gdb output of the running lxcfs just as it crashes which should then give us what we need to sort this out.

1 Like

please, change it by

echo '|/bin/sh -c $@ -- eval exec cat > /var/crash/core-%e.%p' > /proc/sys/kernel/core_pattern

and try to repeat actions which led to the crash

@stgraber
It does nothing yet. It is clean OS.

@amikhalitsyn
ok, I’ve typed that command on mother.
Lets check to next crash and hopefully we’ll have crash dump.

1 Like

There is no crash still. I’ll write when it occurs.

Ok I have crash dump:
https://sdata.net.pl/files/core-lxcfs.5271.tar.gz
Please check it…

1 Like

And I have to add that kernel on mother server is 3.10. Isn’t too old ? Maybe this is the reason ?

no-no, userspace should work without crashes on any supported kernel. 3.10 (rhel7) is not ideal, but okay

The same issue as reported yesterday Handle NULL in releasedir by deleriux · Pull Request #575 · lxc/lxcfs · GitHub

(gdb) bt
#0  __strcmp_sse2_unaligned ()
    at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:31
#1  0x00005577569c4508 in lxcfs_releasedir (path=0x0, fi=0x7f9242d6ac80)
    at ../src/src/lxcfs.c:774
#2  0x00007f92441122b7 in ?? ()
#3  0x0000000000000007 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb) p *(struct fuse_file_info*)0x7f9242d6ac80
$1 = {flags = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0, 
  nonseekable = 0, flock_release = 0, cache_readdir = 0, padding = 0, padding2 = 0, 
  fh = 140266048610592, lock_owner = 0, poll_events = 0}

(gdb) p/x ((struct fuse_file_info*)0x7f9242d6ac80)->fh
$4 = 0x7f923c006120

(gdb) x/8xg 0x7f923c006120
0x7f923c006120:	0x00007f923c005610	0x00007f923c004460
0x7f923c006130:	0x0000000000000000	0x6770757800000000
0x7f923c006140:	0x0000000000000000	0x00007f9200000000
0x7f923c006150:	0x0000000000000040	0x00000000000000a5

(gdb) x/s 0x00007f923c005610
0x7f923c005610:	"systemd"

(gdb) x/s 0x00007f923c004460
0x7f923c004460:	"lxc.payload.complexupgrade/system.slice/systemd-sysusers.service"

Thanks, @wrkilu for providing us with core dump! I think it makes sense to continue catching core dumps for the LXCFS process. I’ve a suspicion that we have 2 different bugs, cause here lxcfs crash on lxd 5.9 rev 24164 · Issue #573 · lxc/lxcfs · GitHub
we crashed on write (!), but in your case, we crashed on read.

No problem man. I do thank you for LXD! not you me.
Still I think LXD is awesome and many thanks to all of you maintainers!

Should I attach second crash when it occur ?

1 Like

Should I attach second crash when it occur ?

Yep, every piece of information may be valuable for debugging. I think we will release a new hotfix version of LXCFS soon, just to address this particular crash that you’ve caught already. I’ll notify you.

Still there wasn’t next crash on my server.

Other question: when you release hot fix ? :wink: Or… is there a way to downgrade LXC in Snap to older good version ?

I think fix will be released this week. I can say that it makes no sense to downgrade, because this is not a degradation. It’s interesting question why you’ve started facing this issue (it can be related to our last fix with turning on direct IO mode for lxcfs, but in fact this is the only right behavior).

See also

This is my first container on dedicated server with Centos 7. I need virtualization so I’ve installed lxd and problems occured. I haven’t even installed any service on it - it has only ssh server - nothing else. And randomly has crashes. So I’m waiting simply for good version to I could go with installing services in it…

@wrkilu you can try sudo snap refresh lxd --channel=latest/candidate