I noticed today that running an exec command like:
sudo lxc exec CONTAINER -- ps aux
Gives me the error:
Error: /proc must be mounted
To mount /proc at boot you need an /etc/fstab line like:
proc /proc proc defaults
In the meantime, run "mount proc /proc -t proc"
After trying to restart the container with sudo lxd restart CONTAINER, the container will not reboot:
Failed to run: zfs mount lxd_storage_pool_01/containers/CONTAINER: cannot mount
'lxd_storage_pool_01/containers/CONTAINER': filesystem already mounted
Try lxc info --show-log CONTAINER for more info
I am running LXD 3.22 on Ubuntu 18.04.
Any ideas how I could proceed? For now I do not wan’t to restart anything for not risking a complete shutdown of my services…
As soon as we have a new build ready, Jenkins will auto-validate it, if that’s green we’ll immediately release to stable. As we’re still in the initial 24h rollout window and only a limited number of users will ever hit this (those with systems that have been up for a while), this should avoid the worst of it.
Unfortunately those affected won’t find themselves fixed when the fix hits.
In all cases, when hitting this issue, you’ll want to check if you have a lxcfs process running on your system. If you don’t, start by running systemctl reload snap.lxd.daemon to get one running again. This is required to have newly started containers behave and to allow restarted containers to also go back to normal.
After that, you effectively have two ways out of this:
Look for grep lxcfs /proc/mounts in every container and unmount (umount) all the matching paths. The container will then behave again but without any of the cpu/memory/load-average/uptime resources being properly reported. Instead you’ll see the host values until such a time as the container is restarted.
Restart all the containers
As a side note, we don’t actually expect anyone who’s been keeping up with critical kernel updates to be able to hit this issue (as they wouldn’t ever reach the uptime needed), so if you’re hitting this, we’d suggest also applying any pending kernel updates to your system and rebooting it while you’re at it. There are important security fixes in those kernels that you definitely want to benefit from.
For additional context on why such issues happen and are so hard to recover from.
LXCFS is a FUSE filesystem that’s exposed in all containers. This comes with some issues. We can’t detach/re-attach to a FUSE filesystem, so killing/restarting the lxcfs binary will break the FUSE mount and anything using it. Additionally, there are no good mechanisms to push/pull mounts from containers, so LXCFS dying doesn’t remove any of those mounts and re-starting it can’t inject them back either.
Because we need to be able to do bugfix, and more critically, security fixes to LXCFS on a running system without breaking all the containers. We have a clever design in place where 99% of what LXCFS does is stuffed into an internal library (liblxcfs.so). The lxcfs binary itself is just a loader for that library as well as a signal handler so when receiving SIGUSR1 it will unload the current copy of the library, load the new copy and keep doing its job.
This means we need to be extremely careful about never breaking backward compatibility in the library as we may be running from a binary that’s several years older than the library is (think upgrading from LXCFS 2.0 to 4.0).
In this particular case, the issue was that the filesystem on which lxcfs was started has since long been unmounted, when the new library is loaded, it would attempt to chdir to that path, fail and crash. This is why only a limited number of users will ever hit this case. The fix we’re rolling out effectively ignores that particular case as that error shouldn’t be fatal.
Additionally a follow-up set of fixes will be pushed to further harden the library reloading code so that even if everything goes terribly wrong, lxcfs will be left running, albeit in a mode where it only shows the host values (similar to having it unmounted from all containers).
snap refresh lxd will get the fix once available. We’re waiting on arm64 to finish building and Jenkins validation at this point. Hopefully both will be complete within 20min or so.
So you have two containers where restarting isn’t an option? lxc exec NAME bash followed with that grep command should still work for them though. Once you get the list of mounts, unmount them with umount -l and things should get back to mostly normal for those containers. Then restart them when convenient for a full fix.
Perfect. Then I will wait for the update. lxc exec NAME bash grep... does not work with the following error: Error: Instance is not running
I will see what I can find out otherwise…
As I said on IRC, please don’t forget hosting providers that have live patched kernel (and AMD hardware) and thus have servers with large uptime.
This is very problematic for us and while I appreciate that this is now fixed, rebooting tons of containers is not cool, nor the hundreds of monitoring alerts that the teams gets when /proc/stat is no longer available.
Seemingly just got the update via autorefresh and apparently the issue was not fixed, as LXD containers on 3 other hosts now failed in the same fashion.
Not all containers however, not sure what differs those that didn’t fail from those which did.
Worked around it in the same way again by umounting everything /proc managed by lxcfs.
for m in $(grep lxcfs /proc/mounts | grep proc | awk {'print $2'}); do umount ${m}; done
If you’ve hit this issue within the past 24h or so, it’s most likely something else.
In such cases, please provide grep lxcfs /var/log/syslog and snap changes.
Mar 18 01:11:02 hqserv lxd.daemon[7784]: Closed liblxcfs.so
Mar 18 01:11:02 hqserv lxd.daemon[7784]: Running destructor lxcfs_exit
Mar 18 01:11:02 hqserv lxd.daemon[7784]: Running constructor lxcfs_init to reload liblxcfs
Mar 18 07:25:10 hqserv lxd.daemon[7784]: *** Error in `lxcfs': double free or corruption (fasttop): 0x00007f328c009050 ***
Mar 18 07:25:10 hqserv lxd.daemon[7784]: /snap/lxd/current/lib/liblxcfs.so(+0xda0b)[0x7f333c9b4a0b]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: /snap/lxd/current/lib/liblxcfs.so(+0x9fe6)[0x7f333c9b0fe6]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: /snap/lxd/current/lib/liblxcfs.so(+0xa1f2)[0x7f333c9b11f2]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: /snap/lxd/current/lib/liblxcfs.so(cg_readdir+0x1ff)[0x7f333c9b14d0]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: lxcfs[0x401ba0]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: lxcfs[0x4026cf]
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 00400000-00406000 r-xp 00000000 07:00 39 /snap/lxd/13814/bin/lxcfs
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 00605000-00606000 r--p 00005000 07:00 39 /snap/lxd/13814/bin/lxcfs
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 00606000-00607000 rw-p 00006000 07:00 39 /snap/lxd/13814/bin/lxcfs
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 7f333c9a7000-7f333c9cb000 r-xp 00000000 07:04 177 /snap/lxd/13840/lib/liblxcfs.so
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 7f333c9cb000-7f333cbca000 ---p 00024000 07:04 177 /snap/lxd/13840/lib/liblxcfs.so
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 7f333cbca000-7f333cbcb000 r--p 00023000 07:04 177 /snap/lxd/13840/lib/liblxcfs.so
Mar 18 07:25:10 hqserv lxd.daemon[7784]: 7f333cbcb000-7f333cbcc000 rw-p 00024000 07:04 177 /snap/lxd/13840/lib/liblxcfs.so
and
ID Status Spawn Ready Summary
38 Done yesterday at 17:29 CET yesterday at 17:29 CET Auto-refresh snap "lxd"
39 Done today at 01:10 CET today at 01:11 CET Auto-refresh snap "lxd"