From what I understand, this is an lxcfs bug, but I’m running the snap installation on Ubuntu.
This has happened twice now on the 3.22 stable channel, why does this keep happening? Why is this not caught on the non-stable channels?
sudo ls /proc
ls: cannot access ‘/proc/stat’: Transport endpoint is not connected
ls: cannot access ‘/proc/swaps’: Transport endpoint is not connected
ls: cannot access ‘/proc/uptime’: Transport endpoint is not connected
ls: cannot access ‘/proc/cpuinfo’: Transport endpoint is not connected
ls: cannot access ‘/proc/loadavg’: Transport endpoint is not connected
ls: cannot access ‘/proc/meminfo’: Transport endpoint is not connected
ls: cannot access ‘/proc/diskstats’: Transport endpoint is not connected
DISTRIB_DESCRIPTION=“Ubuntu 18.04.1 LTS”
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: 3.22/stable
refresh-date: yesterday at 21:25 CDT
snap changes
ID Status Spawn Ready Summary
52 Done yesterday at 12:20 CDT yesterday at 12:20 CDT Auto-refresh snap "lxd"
53 Done yesterday at 21:25 CDT yesterday at 21:25 CDT Auto-refresh snap "lxd"
As for your original question, lxcfs development versions have been in the edge channel for years, the lack of people running those with production systems is what’s causing the recent issues we fixed to have been missed.
We do have all your usual testing on LXCFS, both at the time things get merged and several times a day on all distros we support, those do certainly find bugs and get fixed, the ones that don’t get caught as the much weirder ones that automated testing and static analysis just can’t find.
I did get a crash overnight on one of my own production systems, would still like the result of grep lxcfs /var/log/syslog to see if it’s the same thing and if we’re maybe lucky enough to get slightly more details from your case.
@brauner tells me that based on patterns we saw in my crash, he’s got a reproducer and is working on it.
The current bug likely only affects those running on very old kernels (pre-cgns) or those running containers with nesting enabled (or at least with a /var/lib/lxcfs path inside them).
It’s not triggered based on time as far as we can tell but based on something actually accessing that path. In my case it looks like it would have been something like updatedb or a backup script.
We’re working on a fix now and will push it as soon as available to prevent any further breakage.
There is no global way to turn it off, I also wouldn’t recommend it as it will generally lead to odd crashes.
Without lxcfs, your reported memory and cpu in the container will not follow the limits you apply, so software in the container will think it can use far more memory or CPU than it actually is allowed, causing it to crash when the limit is hit.
lxcfs also handles things like the uptime of the container and the load average.
Turning it off would effectively cause all containers to show the raw system-wide values for all resources rather than the values specific to the container you’re running.
The crash above matches the one we’ve isolated and fixed now in lxcfs, a fix will be rolled out within the next 2 hours. Do note that having the fix installed will not fix the broken containers, those containers need to either have lxcfs unmounted (at which point they’ll see the host values) or be fully restarted (at which point they’ll see their own values again).