Systemd-tmpfiles-clean.service - Failed to start Cleanup of Temporary Directories

Hi,

Im having an issues with some of my containers. In time it seems systemd-tmpfiles-clean.service ends up failing.

systemd-tmpfiles-clean.service: main process exited, code=exited, status=219/CGROUP
systemd[1]: Failed to start Cleanup of Temporary Directories.
Failed to kill control group: Transport endpoint is not connected
Failed to create cgroup /lxc/container_name/system.slice/systemd-initctl.service: Transport endpoint is not connected

df -h gives also,

df: ‘/proc/cpuinfo’: Transport endpoint is not connected
df: ‘/proc/diskstats’: Transport endpoint is not connected
df: ‘/proc/meminfo’: Transport endpoint is not connected
df: ‘/proc/stat’: Transport endpoint is not connected
df: ‘/proc/swaps’: Transport endpoint is not connected
df: ‘/proc/uptime’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/blkio’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuacct,cpu’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuset’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/devices’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/freezer’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/hugetlb’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/memory’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/systemd’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/net_prio,net_cls’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/perf_event’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/pids’: Transport endpoint is not connected

Host.
3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
LXC container
3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The only thing what seems to work at the moment is when I,

systemctl --force reboot

otherwise it wont reboot or halt -p is not working.

Is there any fix to this issue or anyone has encountered the same issue ?

Output above suggests that you’re using lxcfs and that lxcfs on the host has crashed or has been manually restarted which would cause this kind of behavior in any running container.

Note that lxcfs updates should never actually restart it, instead SIGUSR1 should be sent to it so that it can safely re-exec itself, keeping all containers happy.

Well im sure that no lxcfs is not manually restarted. Afterall lxcfs in my case cannot be restarted individually but as it seems it restarts all together with snap.lxd.daemon.service if any.

systemctl status snap.lxd.daemon.service -l

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: disabled)
Active: active (running) since Thu 2019-01-10 23:01:27 EET; 5 days ago
Main PID: 26374 (daemon.start)
CGroup: /system.slice/snap.lxd.daemon.service
‣ 26374 /bin/sh /snap/lxd/9886/commands/daemon.start

Jan 13 01:01:01 lxd.daemon[26374]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/container1/user.slice/user-0.slice/session-c53.scope: Bad file descriptor
Jan 13 01:01:01 lxd.daemon[26374]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/container2/user.slice/user-0.slice/session-c54.scope: Bad file descriptor
Jan 13 01:01:01 lxd.daemon[26374]: bindings.c: 626: recursive_rmdir: Failed to close directory lxc.payload/container3/user.slice/user-0.slice/session-c53.scope: Bad file descriptor
Jan 13 02:01:01 lxd.daemon[26374]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/container4/user.slice/user-0.slice/session-c54.scope: Bad file descriptor
Jan 13 02:01:01 lxd.daemon[26374]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/container5/user.slice/user-0.slice/session-c54.scope: Bad file descriptor
Jan 15 13:51:17 lxd.daemon[26374]: t=2019-01-15T13:51:17+0200 lvl=eror msg=“Failed to remove disk device path” err=“remove /var/snap/lxd/common/lxd/devices/loan-application-backend-lv-dev/disk.loan-application-backend-lv-dev.var-lib-pgsql-10: no such file or directory” path=/var/snap/lxd/common/lxd/devices/loan-application-backend-lv-dev/disk.loan-application-backend-lv-dev.var-lib-pgsql-10
Jan 15 14:28:38 lxd.daemon[26374]: t=2019-01-15T14:28:38+0200 lvl=warn msg=“Detected poll(POLLNVAL) event.”
Jan 15 17:46:40 lxd.daemon[26374]: t=2019-01-15T17:46:40+0200 lvl=warn msg=“Detected poll(POLLNVAL) event: exiting.”
Jan 15 17:46:40 lxd.daemon[26374]: t=2019-01-15T17:46:40+0200 lvl=warn msg=“Detected poll(POLLNVAL) event.”
Jan 16 09:26:13 lxd.daemon[26374]: t=2019-01-16T09:26:13+0200 lvl=warn msg=“Detected poll(POLLNVAL) event.”

df: ‘/proc/cpuinfo’: Transport endpoint is not connected
df: ‘/proc/diskstats’: Transport endpoint is not connected
df: ‘/proc/meminfo’: Transport endpoint is not connected
df: ‘/proc/stat’: Transport endpoint is not connected
df: ‘/proc/swaps’: Transport endpoint is not connected
df: ‘/proc/uptime’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/blkio’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuacct,cpu’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/cpuset’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/devices’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/freezer’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/hugetlb’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/memory’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/systemd’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/net_prio,net_cls’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/perf_event’: Transport endpoint is not connected
df: ‘/sys/fs/cgroup/pids’: Transport endpoint is not connected

Update -

For some time containers behave as expected when I force restart them. But after some X time same endpoint failures appear.

Can it be some kernel limit issue ? or its normal behavior for snap.lxd ? I mean container itself is working and my services also but I guess its not correct way to run them. @stgraber

When that happens, look at the process startup time for lxcfs, if it’s some time after your container was started, then the problem is that lxcfs was restarted or crashed, causing the fuse disconnect.

This happened right after physical blade was restarted,

I doubt this should behave like this

Feb 04 15:38:51 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: => Starting LXD
Feb 04 15:38:51 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: t=2019-02-04T15:38:51+0200 lvl=warn msg=“AppArmor support has been disabled because of lack of kernel support”
Feb 04 15:38:51 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: t=2019-02-04T15:38:51+0200 lvl=warn msg=“Unable to update backup.yaml at this time” name=bank-amladb-uat rootfs=/var/snap/lxd/common/lxd/containers/bank-amladb-uat/r
Feb 04 15:38:58 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/bank-master-datadb-test/system.slice/zabbix-agent.service: Bad file descriptor
Feb 04 15:38:59 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2611: do_write_pids: Error writing pid to child: Bad file descriptor.
Feb 04 15:39:00 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/hny18-bank-crmdb/system.slice/systemd-remount-fs.service: Bad file descriptor
Feb 04 15:39:01 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2421: pid_from_ns: Timeout reading from parent.
Feb 04 15:39:01 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/hny18-bank-master-datadb/system.slice/systemd-random-seed.service: Bad file descriptor
Feb 04 15:39:02 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/hny18-fbodb/system.slice/systemd-journald.service: Bad file descriptor
Feb 04 15:39:02 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2611: do_write_pids: Error writing pid to child: Bad file descriptor.
Feb 04 15:39:02 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/hny18-bigstage/system.slice/systemd-random-seed.service: Bad file descriptor
Feb 04 15:39:04 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2421: pid_from_ns: Timeout reading from parent.
Feb 04 15:39:06 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 823: cgfs_iterate_cgroup: Failed closedir for lxc.payload/loan-application-backend-lv-dev/user.slice/user-0.slice/session-c1.scope: Bad file descriptor
Feb 04 15:39:06 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2064: send_creds: Error getting reply from server over socketpair.
Feb 04 15:39:07 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2611: do_write_pids: Error writing pid to child: Bad file descriptor.
Feb 04 15:39:07 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2611: do_write_pids: Error writing pid to child: Bad file descriptor.
Feb 04 15:39:08 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2134: recv_creds: Timed out waiting for scm_cred: No such file or directory
Feb 04 15:39:08 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: => LXD is ready
Feb 04 15:39:09 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2421: pid_from_ns: Timeout reading from parent.
Feb 04 15:39:09 dca-fx6-3-db.srv.big.local lxd.daemon[27613]: bindings.c: 2421: pid_from_ns: Timeout reading from parent.