Can't stop debian 8, 9 and 10 containers

burbilog · March 13, 2019, 9:09am

It seems to happen on two different hosts: Ubuntu 18.04 (real machine) and CentOS 7 (remote KVM). Both running lxd 3.11 from snapd. If I create debian container:

 lxc launch images:debian/9 deb

and later try to shutdown it, nothing happens, lxc stop deb hangs. However, if I enter into second shell and issue second lxc stop deb command, then first lxc stop deb stops too and container is stopped. It seems, that debian 8, 9 and 10 images are affected, while alpine 3.9 and centos 7 aren’t. Did not try other images.

It seems that debian begins shutdown procedure, removes IP addresses and then something breaks. This is output of lxc list --format csv while lxc stop hanging, deb container had an ip address, but lxc stop removed ip, but did not finish shutdown procedure:

alp,RUNNING,172.16.172.31 (eth0),PERSISTENT,
cms,RUNNING,172.16.172.218 (eth0),PERSISTENT,
dd,RUNNING,172.16.172.47 (eth0),PERSISTENT,
deb,RUNNING,PERSISTENT,
static,RUNNING,172.16.172.57 (eth0),PERSISTENT,

Any ideas how to fix it?

bruce78 · March 13, 2019, 9:30am

lxc stop --force [container-name] is the way forward here…

burbilog · March 13, 2019, 10:53am

Forcing shutdown doesn’t look like good practice… something is wrong.

bruce78 · March 13, 2019, 11:05am

Yeah, there are some debuging comments here…

burbilog · March 13, 2019, 11:28am

Well, I see no errors in console after issuing first lxc stop deb:

[  OK  ] Stopped target Timers.
[  OK  ] Stopped Daily apt upgrade and clean activities.
[  OK  ] Removed slice system-getty.slice.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Stopped target Graphical Interface.
[  OK  ] Stopped target Multi-User System.
         Stopping Network Name Resolution...
         Stopping Login Service...
         Stopping D-Bus System Message Bus...
[  OK  ] Stopped target Login Prompts.
         Stopping Console Getty...
[  OK  ] Stopped Daily apt download activities.
[  OK  ] Stopped target System Time Synchronized.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
[  OK  ] Stopped Login Service.
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped Network Name Resolution.
[  OK  ] Stopped Console Getty.
         Stopping Permit User Sessions...
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped target Slices.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped target Swap.
         Stopping Update UTMP about System Boot/Shutdown...
[  OK  ] Stopped target Encrypted Volumes.
[  OK  ] Stopped Forward Password Requests to Wall Directory Watch.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Network.
         Stopping Network Service...
         Stopping Raise network interfaces...
[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped Network Service.

Here it hangs. If I issue lxc stop deb again in second shell, then this log continues:

[  OK  ] Stopped target Network.
         Stopping Network Service...
[  OK  ] Stopped Network Service.
[  OK  ] Reached target Shutdown.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Halting system.

And then it stops. Also, lxc stop --force deb kills container silently, no delays.

BTW, all faulty containers are bare containers from repository. I just ran lxc launch images:debian/9 deb and that’s all.

simos · March 13, 2019, 12:06pm

It is likely the issue is related to the networking.
Because the first set of logs is interrupted at the Network Service and the second set continues at the Network Service.

Let’s verify this hypothesis.

$ lxc launch images:debian/9 deb
Creating deb
Starting deb
$ lxc shell deb
mesg: ttyname failed: Success
root@deb:~# ifdown eth0
Killed old client process
Internet Systems Consortium DHCP Client 4.3.5
Copyright 2004-2016 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/00:16:3e:2f:45:17
Sending on   LPF/eth0/00:16:3e:2f:45:17
Sending on   Socket/fallback
DHCPRELEASE on eth0 to 10.100.100.1 port 67
root@deb:~# logout
$ lxc stop deb
$ lxc delete deb
$

Therefore, now the container shuts down cleanly.

What could be the issue? What commands does this Debian container image use to shutdown the networking? There is systemd in the container image, but networking is handled by ifupdown.

gpatel-fr · March 13, 2019, 1:04pm

when the container fails to stop is it still possible to do lxc exec bash ?
I had a similar problem and it was still possible and I found the problem in container’s syslog. It seems that the Debian 9 image don’t have rsyslog installed by default, but journalctl should give you the same information.

burbilog · March 22, 2019, 3:45pm

Hmm. It seems that debian/10 container is working now, I can shut it down using single lxc stop command. debian/9 and debian/8 still can’t be shut down using lxc stop.

craigphicks · March 22, 2019, 8:38pm

I don’t know if there is some relation to this old issue:

but the symptom is similar. A lot of interesting comments and workarounds in that post. At that time, the cause was apparently due to systemd ignoring the particular signal used by lxd to request stopping.

craigphicks · March 23, 2019, 12:07am

Comparing the log files from ‘lxc stop [cont]’ vs 'lxc exec [cont] – poweroff` they both start and end differently, but share a lot in the middle.

The former starts with

systemd[1]: Received SIGRTMIN+3.
systemd[1]: Reached target Unmount All Filesystems.
...

whereas the latter doesn’t “Umount All Filesystems” until much later after other tasks. It could be that systemd is treating Received SIGRTMIN+3 as an emergency power loss and therefore unmounts the filesystems as soon as possible to minimize filesystem damage. Even if that causes some problem in the shutdown.

They finish up differently. The “poweroff” method ending is clean:

systemd[1]: Stopped Network Service.
systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
systemd[1]: Stopped Create Volatile Files and Directories.
dhclient[127]: Killed old client process
ifdown[112]: Killed old client process

resulting in a clean shutdown.

In contrast with the “stop” method some processes are restarting themselves while systemd seems to be both starting and stopping:

systemd[1]: Stopped Network Service.
systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
stretch-cc systemd[1]: Stopped Create Volatile Files and Directories.
stretch-cc dhclient[180]: Killed old client process
ifdown[165]: Killed old client process

so far so good, but then it continues into a final zombie state:

dhclient[180]: Internet Systems Consortium DHCP Client 4.3.5
ifdown[165]: Internet Systems Consortium DHCP Client 4.3.5
ifdown[165]: Copyright 2004-2016 Internet Systems Consortium.
ifdown[165]: All rights reserved.
ifdown[165]: For info, please visit https://www.isc.org/software/dhcp/
dhclient[180]: Copyright 2004-2016 Internet Systems Consortium.
dhclient[180]: All rights reserved.
dhclient[180]: For info, please visit https://www.isc.org/software/dhcp/
dhclient[180]: 
dhclient[180]: Listening on LPF/eth0/00:16:3e:e9:b8:9f
ifdown[165]: Listening on LPF/eth0/00:16:3e:e9:b8:9f
ifdown[165]: Sending on   LPF/eth0/00:16:3e:e9:b8:9f
ifdown[165]: Sending on   Socket/fallback
dhclient[180]: Sending on   LPF/eth0/00:16:3e:e9:b8:9f
dhclient[180]: Sending on   Socket/fallback
dhclient[180]: DHCPRELEASE on eth0 to 10.185.64.1 port 67
ifdown[165]: DHCPRELEASE on eth0 to 10.185.64.1 port 67
systemd[1]: Reached target Final Step.
systemd[1]: Stopped Raise network interfaces.
systemd[1]: Stopped Apply Kernel Variables.
systemd[1]: systemd-networkd.service: Failed to reset devices.list: Operation not permitted
systemd[1]: systemd-networkd.service: Failed to set invocation ID on control group /system.slice/systemd-networkd.service, ignoring: Operation not permitted
systemd[1]: Starting Network Service...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopped target Local File Systems (Pre).
systemd[1]: Stopped Create Static Device Nodes in /dev.
systemd[1]: Stopped Remount Root and Kernel File Systems.
systemd[1]: Stopped Load Kernel Modules.
systemd-networkd[192]: Enumeration completed
systemd-networkd[192]: eth0: Removing non-existent address: 10.185.64.200/24 (valid forever)
systemd[1]: Started Network Service.
systemd-networkd[192]: eth0: Removing non-existent address: fd42:f2c5:781c:6810:216:3eff:fee9:b89f/64 (valid for 59min 59s)
systemd-networkd[192]: eth0: Removing non-existent address: fe80::216:3eff:fee9:b89f/64 (valid forever)
systemd[1]: Reached target Network.

If we suppose this behavior is not an accident, then it can be accounted for supposing that systemd’s goal is only to unmount media to prevent data corruption and NOT to shut down. That supposition seems plausible when reading between the lines of this discussion.

So in the Debian/stretch I got from linuxcontainers, sigpwr.target is an empty stub. Linking it to ‘poweroff.target’ doesn’t have any effect. I think the current response to SIGRTMIN+3 is baked into systemd compiled code - maybe it can’t be overridden?.

Debian 8/9 are going to be around for awhile. Maybe an lxd/lxc configuration option to translate ‘lxc stop [cont]’ into a poweroff command would be practical.

Otherwise, if there were a method to change the systemd configuration that didn’t depend upon systemd not breaking that method in future releases, that would be a workaround.

Sylvain_Le_Blanc · February 26, 2020, 4:45pm

I experience the same problem on debian 9.x fully patch

As a work around I do lxc exec C1 – shutdown -h now.