Cannot stop unprivileged container, not even 'kill -9' its systemd process on host

gabe · January 19, 2018, 2:11pm

Hello,

I have an odd sort of problem. I created an unprivileged container (as a dedicated non-root user) on my Debian system running stretch (current stable). The container starts alright and I can attach to it, but I cannot stop it or shut it down using any of:

‘halt’, ‘poweroff’, ‘reboot’ or ‘/sbin/shutdown -h’ from within the container,
‘lxc-stop [–kill [–nolock]]’ on the host, as the user who “owns” the container,
or even ‘kill [-9]’ with the container’s systemd PID as ‘root’ on the host.

To create the container, I mostly followed the LXC page on the Debian Wiki, but I referred to another guide, as well, since I wanted to understand this SUBUID/SUBGID stuff and it was explained better there.

Here’s what I did to create the container:

1. Made sure all required packages were installed:

cgroupfs-mount
liblxc1
libpam-cgroup
libvirt0
lxc

and their dependencies:

libcgroup1
libnl-3-200
libnl-route-3-200
libxen-4.8
libxenstore3.0
libyajl2
python3-lxc

Some other relevant packages (like cgmanager) were already installed from earlier
experiments with LXC.

2. Checked system configuration:

# lxc-checkconfig 
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-4.9.0-4-amd64
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled

--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
Bridges: enabled
Advanced netfilter: enabled
CONFIG_NF_NAT_IPV4: enabled
CONFIG_NF_NAT_IPV6: enabled
CONFIG_IP_NF_TARGET_MASQUERADE: enabled
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled
FUSE (for use with lxcfs): enabled

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: enabled

3. Created a new user and system group, specially for LXC:

The system group is called ‘lxc’ and has GID 113.
The user is called ‘metis’, has UID 30000, is in a user group ‘metis’ (GID 30000) AND the system group ‘lxc’.

The user got the following SUBUID/SUBGID ranges assigned to them:

# grep metis /etc/sub[gu]id
/etc/subgid:metis:493216:65536
/etc/subuid:metis:493216:65536

The user’s home directory ‘/srv/lxc/metis’ exists, belongs to metis:metis, has permissions 0750 and is a btrfs subvolume (if that matters).

4. Enabled user namespaces:

# echo 1 > /proc/sys/kernel/unprivileged_userns_clone 
# echo "kernel.unprivileged_userns_clone=1" > /etc/sysctl.d/80-lxc-userns.conf

5. Copied and adjusted the default configuration:

As the new, dedicated LXC user ‘metis’:

$ mkdir -p .config/lxc
$ cp /etc/lxc/default.conf .config/lxc/
$ echo "lxc.id_map = u 0 "`grep $USER /etc/subuid | cut --delimiter=":" --output-delimiter=" " --fields=2,3` >> .config/lxc/default.conf 
$ echo "lxc.id_map = g 0 "`grep $USER /etc/subgid | cut --delimiter=":" --output-delimiter=" " --fields=2,3` >> .config/lxc/default.conf
$ echo "lxc.mount.auto = proc:mixed sys:ro cgroup:mixed" >> .config/lxc/default.conf

Result:

$ cat .config/lxc/default.conf
lxc.network.type = empty
lxc.id_map = u 0 493216 65536
lxc.id_map = g 0 493216 65536
lxc.mount.auto = proc:mixed sys:ro cgroup:mixed

6. Fixed access permissions to /srv/lxc/metis/.local/…

As ‘root’ (on the host):

# setfacl -m u:493216:x /srv/lxc/metis /srv/lxc/metis/.local /srv/lxc/metis/.local/share /srv/lxc/metis/.local/share/lxc

7. Actually created the container:

$ lxc-create --name metis --template download
Setting up the GPG keyring
Downloading the image index
[...]

Distribution: debian
Release: stretch
Architecture: amd64

Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

---
You just created a Debian container (release=stretch, arch=amd64, variant=default)
[...]
Use lxc-attach or chroot directly into the rootfs to set a root password
or create user accounts.

At this point, the container existed and could be started and used:

metis@iupiter:~$ lxc-ls
metis 

metis@iupiter:~$ lxc-info -n metis
Name:           metis
State:          STOPPED

metis@iupiter:~$ lxc-start -n metis

metis@iupiter:~$ lxc-attach -n metis

root@metis:/# ps -eF
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         1     0  0 14076  3036   1 23:03 ?        00:00:00 /sbin/init
root        23     0  0  4953  3656   1 23:03 pts/2    00:00:00 /bin/bash
root        24    23  0  9576  3104   1 23:03 pts/2    00:00:00 ps -eF

But it was apparent that something was wrong with systemd:

root@metis:/# systemctl 
Failed to connect to bus: No such file or directory

… and there wasn’t even a way toshut the container down.
Not from within:

root@metis:/# /sbin/shutdown -h
Failed to connect to bus: No such file or directory

root@metis:/# /sbin/halt
Failed to connect to bus: No such file or directory
Failed to talk to init daemon.

root@metis:/# /sbin/poweroff
Failed to connect to bus: No such file or directory
Failed to talk to init daemon.

root@metis:/# /sbin/init 0
Couldn't find an alternative telinit implementation to spawn.

root@metis:/# kill 1

root@metis:/# kill -9 1

Not as user ‘metis’ from the host:

metis@iupiter:~$ lxc-stop -n metis
(hung forever, had to kill with ^C)

metis@iupiter:~$ lxc-stop -n metis --kill
(likewise -> ^C)

metis@iupiter:~$ lxc-stop -n metis --kill --nolock
(likewise -> ^C)

And not even by killing the container’s systemd process
as ‘root’ on the host:

root@iupiter:~# pstree -p
systemd(1)─┬─...
          ...
           ├─lxc-start(5971)───systemd(5982)
          ...
root@iupiter:~# kill 5982
root@iupiter:~# kill -9 5982
root@iupiter:~# ps 5982
  PID TTY      STAT   TIME COMMAND
 5982 ?        Ds     0:00 /sbin/init

In the end, the only way to shut the container down
was to reboot the system.

I would greatly appreciate any help anyone could give me
and will gladly provide any further info you might need.

gabe · January 23, 2018, 3:24pm

I tried an unprivileged container as ‘root’, but got the same result.

Here’s what I did:

1. Added SUBUID/SUBGID ranges for user 'root’
I added the following line to both ‘/etc/subuid’ and ‘/etc/subgid’
root:558752:65536

(where 558752 is the first free SUBUID and SUBGID. I checked this, of course.)

2. Adjusted default container configuration
’/etc/lxc/default.conf’:

lxc.network.type = empty
lxc.id_map = u 0 558752 65536
lxc.id_map = g 0 558752 65536
lxc.mount.auto = proc:mixed sys:ro cgroup:mixed

3. Created the container

root@iupiter:~# lxc-create --name foobar --template download --logfile=/root/lxc_foobar.log --logpriority==WARN
Setting up the GPG keyring
Downloading the image index
...
Distribution: debian
Release: stretch
Architecture: amd64

Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

---
You just created a Debian container (release=stretch, arch=amd64, variant=default)
...

4. Started the container

root@iupiter:~# lxc-start -n foobar -o /root/lxc_foobar.log -l WARN

5. Inspected the container

root@iupiter:~# lxc-attach -n foobar
root@foobar:~# ps -eF
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         1     0  0 14076  2860   1 Jan27 ?        00:00:00 /sbin/init
root        23     0  0  4944  3544   0 Jan27 ?        00:00:00 /bin/bash
root        24    23  0  9576  3244   1 Jan27 ?        00:00:00 ps -eF

I now believe that this is an unusually small number of processes to be running inside a freshly installed container. Also, I seem to recall that in a different, properly working (albeit privileged) container I had, there was no ? in the TTY column. And the D-Bus communication doesn’t work, again:

root@foobar:~# systemctl
Failed to connect to bus: No such file or directory

6. Tried to shut the container down

First from within:

root@foobar:~# shutdown -h now
Failed to connect to bus: No such file or directory
Failed to talk to init daemon.

root@foobar:~# kill -9 1
root@foobar:~# ps -eF
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         1     0  0 14076   632   1 Jan27 ?        00:00:00 /sbin/init
...

Then from the host using LXC tools:

root@iupiter:~# lxc-stop -n foobar
^C

(killed after 30 mins.)

root@iupiter:~# lxc-stop -n foobar --kill
^C

(killed after 10 mins.)

root@iupiter:~# lxc-stop -n foobar --kill --nolock
^C

(killed after 15 mins.)

Finally from the host using ‘kill’:

root@iupiter:~# pstree -p
systemd(1)─┬─...
          ...
           ├─lxc-start(1960)───systemd(1964)
          ...

root@iupiter:~# kill 1964
root@iupiter:~# ps -F 1964
UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
558752    1964  1960  0 14076   632   1 Jan22 ?        Ds     0:00 /sbin/init

root@iupiter:~# kill -9 1964
root@iupiter:~# ps -F 1964
UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
558752    1964  1960  0 14076   632   1 Jan22 ?        Ds     0:00 /sbin/init

Again, nothing worked. I expect I’ll have to reboot the host system again to shut down the container and it is, of course, unusable to me in that state.

Any help or advice anyone could give would be greatly appreciated.

stgraber · January 28, 2018, 8:47pm

The process being stuck in I/O wait is a pretty bad sign…
Do you have anything in dmesg from the kernel which would explain why you seem to be in a deadlock type situation?

gabe · January 29, 2018, 8:37am

Hi Stéphane, thanks for your reply.

I hadn’t even looked at the state of the systemd process, but you’re right of course: systemd’s stuck in uninterruptible sleep… somethings’s wrong.

So I started the unpriv. container with lxc-start and the following new message was written to the kernel ring buffer:

[213535.429938] cgroup: new mount options do not match the existing superblock, will be ignored

At this point, the container was already stuck again.

metis@iupiter:~$ lxc-attach -n metis

root@metis:/# ps -F 1
UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
root         1     0  0 14076  2932   0 Jan31 ?        Ds     0:00 /sbin/init

root@metis:/# systemctl
Failed to connect to bus: No such file or directory

And then, two minutes later, dmesg output this:

213747.152186] INFO: task systemd:24179 blocked for more than 120 seconds.
[213747.152206]       Not tainted 4.9.0-5-amd64 #1 Debian 4.9.65-3+deb9u2
[213747.152216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[213747.152228] systemd         D    0 24179  24169 0x00000100
[213747.152236]  ffff97ec8c15a800 ffff97ecec7c9c00 ffff97ecc3f46f00 ffff97ecefc18940
[213747.152243]  ffff97ecb5392480 ffffb9bf40c03bc0 ffffffff9b002923 ffff97ecc3f46f00
[213747.152249]  00ff97ecc3f46f00 ffff97ecefc18940 ffff97ecebdec2c0 ffff97ecc3f46f00
[213747.152254] Call Trace:
[213747.152267]  [<ffffffff9b002923>] ? __schedule+0x233/0x6d0
[213747.152273]  [<ffffffff9b002df2>] ? schedule+0x32/0x80
[213747.152279]  [<ffffffff9b005999>] ? rwsem_down_write_failed+0x1f9/0x360
[213747.152287]  [<ffffffff9ac801d0>] ? kernfs_sop_show_options+0x30/0x30
[213747.152293]  [<ffffffff9ad38213>] ? call_rwsem_down_write_failed+0x13/0x20
[213747.152298]  [<ffffffff9b005039>] ? down_write+0x29/0x40
[213747.152304]  [<ffffffff9ac055ab>] ? grab_super+0x2b/0x90
[213747.152309]  [<ffffffff9ac05b43>] ? sget_userns+0x163/0x490
[213747.152314]  [<ffffffff9ac80240>] ? kernfs_sop_show_path+0x40/0x40
[213747.152319]  [<ffffffff9ac8044a>] ? kernfs_mount_ns+0x7a/0x220
[213747.152324]  [<ffffffff9ab0dba4>] ? cgroup_mount+0x334/0x810
[213747.152331]  [<ffffffff9ac06ae6>] ? mount_fs+0x36/0x150
[213747.152336]  [<ffffffff9ac23f32>] ? vfs_kern_mount+0x62/0x100
[213747.152340]  [<ffffffff9ac263ff>] ? do_mount+0x1cf/0xc80
[213747.152346]  [<ffffffff9ac271de>] ? SyS_mount+0x7e/0xd0
[213747.152351]  [<ffffffff9aa03b1c>] ? do_syscall_64+0x7c/0xf0
[213747.152357]  [<ffffffff9b0076ee>] ? entry_SYSCALL64_slow_path+0x25/0x25

This repeats every 2 minutes with exactly the same call trace.

Looks like a problem mounting cgroups. Does that make sense?
I’m a bit out of my depth here, but for what it’s worth:

I was able to create a privileged container on the same host and it runs normally
I tried to reproduce this issue on a desktop machine, also running Debian stretch, but everything worked as it should there, i.e. I could create, start and stop unprivileged containers without any problem.
The afflicted server has a btrfs rootfs (the desktop machine uses ext4).
I started working through Konstantin Ivanov’s book Containerization with LXC and from what I can see the namespaces all seem to work correctly - for root and for users with and without SUBUID/SUBGID ranges. I haven’t gotten to the cgroups yet.

gabe · January 29, 2018, 10:36pm

On a whim, I had a look at the cgroup mounts on an unrpivileged container on the other (desktop) hosts, where containers run without problems:

root@bar:/# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755,uid=1738400,gid=1738400)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)

… and then compared that to the list of cgroup mounts in the unprivileged container that gets stuck, whichs is as follows:

root@metis:/# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755,uid=493216,gid=493216)

That’s all. So I suspect that it’s the mounting of the first cgroup that gets stuck, somehow, and blocks the container henceforth.

I don’t know in what order these mounts happen, but I believe mount shows them in that order, so it would be the mounting of /sys/fs/cgroup/systemd that’s blocking.

I also wanted to compare this to the privileged container on the same host, which used to work fine.
Now, though, that one is getting stuck in the same way, too. It looks like the first frozen or stuck cgroup mount blocks all subsequent ones, even across mount namespaces, if that’s even possible.

gabe · January 31, 2018, 8:30pm

I changed the lxc.mount.auto line in the .config/lxc/default.conf file in the following ways:
1. lxc.mount.auto = proc:mixed sys:ro cgroup:rw
2. lxc.mount.auto = proc:mixed sys:ro
3. (removed it completely)

Before each change, I destroyed the unprivileged container, then made the change and finally recreated the container again, but to no avail. It always got stuck the same way.

ShellCode · February 25, 2018, 5:56pm

I’ve got the exact same problem ! Can’t stop containers, the only way is to ‘kill -9’ the lxc monitor. I’d like to add that my services (such as nginx) inside the LXC won’t start with the container ! This definitely looks like a systemd issue !

EDIT :
found this using dmesg :

[ 491.113987] cgroup: new mount options do not match the existing superblock, will be ignored
[ 726.265713] INFO: task systemd:2932 blocked for more than 120 seconds.
[ 726.267047] Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u2
[ 726.268322] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 726.269640] systemd D 0 2932 2924 0x00000100
[ 726.271092] ffff97dad13e4800 0000000000000000 ffff97da54ae0080 ffff97dadf398940
[ 726.272427] ffff97dad5b55080 ffffa34c4391bbb0 ffffffff8580c649 ffff97da54ae0080
[ 726.273797] 0000000000000000 ffff97dadf398940 ffff97dacfdd7078 ffff97da54ae0080
[ 726.275158] Call Trace:
[ 726.276467] [] ? __schedule+0x239/0x6f0
[ 726.277819] [] ? schedule+0x32/0x80
[ 726.279076] [] ? rwsem_down_write_failed+0x1f9/0x360
[ 726.280299] [] ? kernfs_sop_show_options+0x40/0x40
[ 726.281530] [] ? call_rwsem_down_write_failed+0x13/0x20
[ 726.282784] [] ? down_write+0x29/0x40
[ 726.284101] [] ? grab_super+0x2b/0x90
[ 726.285369] [] ? sget_userns+0x165/0x490
[ 726.286644] [] ? kernfs_sop_show_path+0x50/0x50
[ 726.287924] [] ? kernfs_mount_ns+0x7a/0x220
[ 726.289163] [] ? cgroup_mount+0x334/0x820
[ 726.290435] [] ? mount_fs+0x3b/0x160
[ 726.291631] [] ? vfs_kern_mount+0x62/0x100
[ 726.292779] [] ? do_mount+0x1cf/0xc80
[ 726.293941] [] ? SyS_mount+0x7e/0xd0
[ 726.295099] [] ? do_syscall_64+0x8f/0xf0
[ 726.296222] [] ? entry_SYSCALL_64_after_swapgs+0x42/0xb0

stgraber · February 25, 2018, 8:01pm

Looks like a kernel bug in the cgroupfs mount codepath, I’d recommend filing a kernel bug with your Linux distribution as that kernel bug would certainly explain hung processes.

ShellCode · February 26, 2018, 12:06am

The thing is, lxc-stop used to work ! That’s why I don’t think this is a kernel issue. I’m not 100% sure but I think this bug appeared when I installed libpam-cgfs… Before that, in order to be able to start a container I had to ssh into the container, and run a script I found on the Internet (maybe it’s yours, not sure) as root on the host. I magically discovered libpam-cgfs which solved my lxc-start problem, the thing is that now lxc-stop hangs… I installed Debian because of its stability, but Debian is a pain in the A when dealing with LXC…

ShellCode · February 26, 2018, 10:37am

I tried to uninstall libpam-cgfs but it didn’t work. After hours and hours of searching, hard rebooting my server almost every hour, I found this : https://lists.linuxcontainers.org/pipermail/lxc-users/2016-December/012612.html

It says that if the umask is not permissive enough, containers won’t stop. And indeed I recently changed the umask to 027 because I think the default 022 of Debian is too permissive. I’ve restored it, and now it’s working again !! Tell me if I’m wrong but I don’t think this is a kernel issue, but rather LXC which doesn’t chmod files properly.

gabe · February 28, 2018, 9:55pm

Same here,

I too found Debian’s default umask too permissive and had also changed the system-wide umask from 0022 to 0027, but this was years ago. I had not found the message on the lxc-users mailing list that @ShellCode mentions above, so it didn’t occur to me that this may be an issue.

I can confirm that setting umask to 0022 for my dedicated LXC container users fixes the problem for me.

Shutting the container down was a bit slow (I’ll investigate more), but did work.

If I find the time, I might dig into the issue a bit more, to really understand what’s going on. For now, though, I consider this issue fixed (albeit by workaround).

mtmiller · March 2, 2018, 7:57pm

Thank you both, I also ran into this exact issue, apparently because of my restrictive umask as well.

Changing my unprivileged user’s umask to 0022 allowed the container to start up properly.

Once I got it working once, I am now able to create and start new unprivileged containers, even after changing the default umask back to 0027. I tried to put things back the way they were, but I am now unable to reproduce the original problem on this system.

Update: this depends on the unprivileged user’s umask at login time, maybe having to do with the permissions of containerized cgroups created by lxcfs / pam_cgfs.so. Will dig in further.