LXC's failing to start (create new LXC fails as well)

Hello, I recently started having a problem with unprivileged containers on a bare-metal Proxmox (Debian Bullseye.)

I’ve had LXCs running for more than a year without issue. Then about 10 days ago they would not start. Creating new privileged containers works fine. VMs run fine. No other system issues that I see other than this. Underlying filesystems are zfs and scrub without issue. Nothing glaring in journalctl or dmesg other than indications directly related to the problem (which I’ll add below.)

I opened a thread on Proxmox forums and there’s a lot of details about the system config and errors, and followup from support personnel, but no solution as yet.

I can’t really correlate the problem with anything I’ve done on the system other than some hardware changes (PCIe card swaps for testing a U.2 SSD and an old Radeon GPU, enabling a 2nd nic on the motherboard, adding an additional spinning HD, etc.)

At the time, the system was mostly (within a couple weeks) updated with pve/debian packages, and is fully up-to-date now. No difference in resolving the issue.

Today I hit up irc and someone there very helpfully walked me through a bunch of trouble-shooting steps and asked me to followup here.

I will try to summarize but a lot of the walkthru was new territory for me, so please bear with me. Mostly this is just a copy/paste, some is edited for brevity.

<gl0woo> error on startup >> ()lxc_spawn: 1734 Operation not permitted - Failed to clone a new set of namespaces
<gl0woo> also cannot create new unprivileged lxc >> ../src/lxc/cmd/lxc_usernsexec.c: 407: main - Operation not permitted - Failed to unshare mount and user namespace
<gl0woo> & >> ../src/lxc/cmd/lxc_usernsexec.c: 452: main - Inappropriate ioctl for device - Failed to read from pipe file descriptor 3
<gl0woo> lxc-checkconfig >> LXC version 5.0.0
<gl0woo> the issues that program finds are as follows (seems the rest is ok)
<gl0woo> Cgroup v1 systemd controller: missing
<gl0woo> Cgroup v1 freezer controller: missing
<gl0woo> Cgroup ns_cgroup: required
<amikhalitsyn> gl0woo: did you perform any update of your system recently (kernel, userspace)? Please check cat /proc/sys/kernel/unprivileged_userns_clone
<gl0woo> i believe the problems started after a reboot, having made some hardware changes, swapping pcie cards, enabling a 2nd nic on the m/b 
<gl0woo> but no system package updates
<gl0woo> i have since updated packages, debian bullseye
<gl0woo> 'uname -a' >> Linux host 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
<gl0woo> the contents of that file is '1'
<gl0woo> >> root@host:~# ls -laFtrh /proc/sys/kernel/unprivileged_userns_clone
<gl0woo> >> -rw-r--r-- 1 root root 0 Dec  1 07:02 /proc/sys/kernel/unprivileged_userns_clone
<gl0woo> there are some related messages seen with journalctl, these are the last four...
<gl0woo> Dec 01 02:42:07 host systemd[1]: pve-container@123.service: Main process exited, code=exited, status=1/FAILURE
<gl0woo> Dec 01 02:42:07 host systemd[1]: pve-container@123.service: Failed with result 'exit-code'.
<gl0woo> Dec 01 02:42:15 host pvestatd[4150]: modified cpu set for lxc/123: 0-3
<gl0woo> Dec 01 02:42:15 host pvestatd[4150]: failed to open '/sys/fs/cgroup/lxc/123/cpuset.cpus' - Permission denied
<amikhalitsyn> gl0woo: so, probably you had kernel package update long time before reboot. So, after reboot you've got into the new kernel.
<gl0woo> that's possible
<gl0woo> but of all the people running proxmox i'm seemingly the only one with the issue. i've had a thread over there open for a week.
<amikhalitsyn> please, recheck and confirm that you have "1" in /proc/sys/kernel/unprivileged_userns_clone
<amikhalitsyn> EPERM from unshare most probably comes from the check where this sysctl knob is involved
<snip>
<gl0woo> cat shows it contains '1' 
<amikhalitsyn> okay, "1" is good. Strange. Then please check dmesg | grep audit, possibly you will notice some denials
<gl0woo> [ 1652.595589] audit: type=1400 audit(1669903901.683:21): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-123_</var/lib/lxc>" pid=32783 comm="apparmor_parser"
<gl0woo> [ 1652.745389] audit: type=1400 audit(1669903901.835:22): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-123_</var/lib/lxc>" pid=32785 comm="apparmor_parser"
<amikhalitsyn> try dmesg | grep DENIED
<gl0woo> nothing for that
<amikhalitsyn> okay, then we can just perform tracing of your kernel to understand what happens (-:
<gl0woo> ok
<amikhalitsyn> https://gist.github.com/mihalicyn/586a8650ca4ca782cf09a23f19cb0db2
<gl0woo> https://pastebin.com/ifBGq0px
<amikhalitsyn> This trace describes successfull unshare() call.
<amikhalitsyn> Have you reproduced EPERM failure during tracing?
<gl0woo> this is two different lxc's both failed >> https://pastebin.com/RYbXFG4k
<amikhalitsyn> perf probe 'unshare_userns%return $retval'
<amikhalitsyn> perf probe 'unshare_nsproxy_namespaces%return $retval'
<amikhalitsyn> perf probe 'ksys_unshare%return $retval'
<amikhalitsyn> You need to execute this 3 commands, then run gftrace (as before) and reproduce the problem
<amikhalitsyn> From your trace, I can't see that userns was allocated which is really strange.
<amikhalitsyn> you can also run unshare -Um if it fails. It should be sufficient for our needs.
<amikhalitsyn> Because unshare -Um should work.
<gl0woo> here you go >> https://pastebin.com/T9PYUBdQ
<amikhalitsyn> /* ksys_unshare__return: (__x64_sys_unshare+0x12/0x20 <- ksys_unshare) arg1=0x0 */
<amikhalitsyn> so, unshare returned 0. it's successful run
<amikhalitsyn> okay. let's try from the other side. Run gftrace and in parallel strace -e unshare,setns -f unshare -mnU true and show output of strace+gftrace
<amikhalitsyn> strace -e unshare,setns -f unshare -mnU true
<amikhalitsyn> that's one command.
<amikhalitsyn> I think you need to fill topic on our forum https://discuss.linuxcontainers.org/
<amikhalitsyn> And we'll continue investigation of your problem.
<gl0woo> here's the gftrace >> https://pastebin.com/dQ7GsWUY
<gl0woo> root@host:~# strace -e unshare,setns -f unshare -mnU true
<gl0woo> unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWNET) = -1 EPERM (Operation not permitted)
<gl0woo> unshare: unshare failed: Operation not permitted
<gl0woo> +++ exited with 1 +++

Here is the full output for ’ lxc-checkconfig’…

root@host:~# lxc-checkconfig
LXC version 5.0.0
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-5.15.74-1-pve
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled

--- Control groups ---
Cgroups: enabled
Cgroup namespace: enabled

Cgroup v1 mount points: 


Cgroup v2 mount points: 
/sys/fs/cgroup

Cgroup v1 systemd controller: missing
Cgroup v1 freezer controller: missing
Cgroup ns_cgroup: required
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled, not loaded
Macvlan: enabled, not loaded
Vlan: enabled, not loaded
Bridges: enabled, not loaded
Advanced netfilter: enabled, not loaded
CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, loaded
FUSE (for use with lxcfs): enabled, not loaded

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: 

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig

Again, there is more detail at the Proxmox thread as to system config, what version of packages are running, etc. Sorry for the very long-winded first post.

Thanks!

Hello again. This is resolved. The problem was with my zfs configuration… I had a recursive rpool snapshot on a non-root dataset mounted at ‘/’…

In a nutshell, there were two datasets mounted at root at the same time.

As soon as I unmounted the snapshot, changed canmount to noauto, and changed the mountpoint to the path where the remote snapshot actually lives, I tested 3 unprivileged LXCs and they all started fine.

I also reproduced the error by changing the mountpoint back to ‘/’ and mounting it again, same errors seen again on LXC startup. There’s more detail on the Proxmox thread if interested.

Thanks to @amikhalitsyn and I apologize for taking up their time.

No problem at all. Thanks for sharing the root cause.