Can't stop container..stays in running state althoug IP address dissappears

idef1x · May 3, 2018, 11:30am

Not sure if it’s because I installed Ubuntu 18.04 freshly and imported the containers (lxd import -f ) but I got it several times now that I am having problems with stopping a container. Can’t find anything else than that the /var/log/lxd/lxd.log shows:

container=leecher lvl=eror msg=“Error calling 'lxd forkgetnet” output=“error: open mntns: No such file or directory\nerrno: 2\nFailed setns to container network namespace: Invalid argument\n” pid=2715 t=2018-05-03T13:15:42+0200

and the lxc.log of the container has entries like:
lxc 20180503111756.925 ERROR lxc_attach - attach.c:lxc_attach:1185 - No such file or directory - Failed to attach to mnt namespace of 26640
lxc 20180503111757.929 ERROR lxc_attach - attach.c:lxc_attach:1185 - No such file or directory - Failed to attach to mnt namespace of 26665
lxc 20180503111759.240 ERROR lxc_attach - attach.c:lxc_attach:1185 - No such file or directory - Failed to attach to mnt namespace of 26693
lxc 20180503111759.141 ERROR lxc_attach - attach.c:lxc_attach:1185 - No such file or directory - Failed to attach to mnt namespace of 26714
lxc 20180503111844.851 ERROR lxc_attach - attach.c:lxc_attach:1185 - No such file or directory - Failed to attach to mnt namespace of 29108

The last bits of the console.log shows:
[ OK ] Stopped target Local File Systems.
Unmounting /proc/meminfo…
Unmounting /proc/uptime…
Unmounting /proc/swaps…
Unmounting /dev/fuse…
Unmounting /sys/devices/virtual/net…
Unmounting /dev/lxd…
Unmounting /proc/stat…
Unmounting /dev/.lxd-mounts…
Unmounting /dev/ptmx…
Unmounting /proc/sysrq-trigger…
[ OK ] Stopped target Local File Systems (Pre).
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Reached target Shutdown.
Sending SIGTERM to remaining processes…
Sending SIGKILL to remaining processes…
Sending SIGKILL to PID 482 (transmission-da).
Sending SIGKILL to PID 494 (python).
Sending SIGKILL to PID 552 (python).
Halting system.

It looks a bit like “lxc stop $CONTAINER” freezes , but I can’t attach to the console of this container anymore…then I get an EOF back…So something is in hanging state and can’t figure what is waiting for what…Any ideas or shall I just kill -9 the pids still active shown in the lxc.log?

edit: just killing the process that I thought would keep the container from shutting down (listed as one under the container tree found with ps fauxww) doesn’t seem to kill it either…
NB: the process is in defunct state

stgraber · May 3, 2018, 6:20pm

Does lxc stop leecher --force stop it?

lxc stop only sends a signal to the container’s init system that it’s time to shutdown.
If the init system doesn’t then perform a full shutdown, the container will be left running.
This is the equivalent of sending an ACPI shutdown notification, it may or may not lead to complete shutdown.

idef1x · May 3, 2018, 6:48pm

No --force didn’t help either. Finally after a few hours I rebooted the host, which took a while since it was still waiting for the container to shutdown. Luckily it went through and after the reboot the container was backup running and this time I could normally stop it. Luckily I don’t have to many containers running, but rebooting a whole to get 1 container back inline is a bit over the top
Anyway I first had upgraded my Ubuntu 16.04 server (LXD 2.20) to 18.04 and so LXD 3.0, but decided later to fresh install 18.04 since the system didn’t seem stable after the upgrade. Same issue with containers not wanting to stop. Fresh install apparently didn’t matter.

stgraber · May 3, 2018, 6:57pm

@idef1x could have been a kernel issue then, if this happens again, look for processes stuck in I/O wait (D state in ps) or for scary errors in dmesg.

idef1x · May 3, 2018, 7:40pm

OK I’ll will…thanks for the feedback and support.

idef1x · May 2, 2019, 8:59am

Hi @stgraber got it again. Indeed there is a systemd-resolved processes in D state running where systemd-shutdown is waiting for to finish. I guess it has to do with a nfs mount inside the container. After killing the monitor process (parent of it), I can start/stop the container normally again however the original systemd-resolved process stays in D state. Again probably due to the original nfs mount in the container cause I saw and see another process in D state with an ipv6 address in it: [fd00:192:168:10]. Not a complete address, but I assume it’s the address of my nfs server.
Anyway it looks like I have to reboot my host again to resolve it, but since I am away from home at the moment, I’ll wait for it when I am back home before the host won’t boot at all because of this. (I am afraid I have to pull the power plug)

Oh and I see “scary” messages in the syslog: nfs: server asterix.lan not responding, timed out
So must be the mount…not sure why it couldn’t unmount, but looking at the message I get the feeling the container brings networking down before unmounting the network shares

edit: not sure if this should be see in the logs when shutting down the container:
May 2 18:09:01 asterix kernel: [115104.814517] audit: type=1400 audit(1556813341.266:575): apparmor=“DENIED” operation=“mount” info=“failed type match” error=-13 profile=“lxd-home_leecher_</var/snap/lxd/common/lxd>” name="/bin/" pid=25431 comm="(ionclean)" flags=“ro, remount, bind”

it’s a privileged container with the following config set:
lxc config set leecher raw.apparmor “mount fstype=nfs,”
Should be enough right?