Docker run fails in container with security.nesting true

Environment

  • 5.10.0-kali7-amd64
  • lxd version 4.14 (from snap)
  • cgroup V2
  • docker version 20.10.2

Per readthedocs :

In order to run Docker inside a LXD container the security.nesting property of the container should be set to true.

Docker run (in container):

$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete 
Digest: sha256:5122f6204b6a3596e048758cabba3c46b1c937a46b5be6225b835d091b90e46c
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: can't load program: operation not permitted: unknown

Container logs:

Jun  2 16:04:28 docker systemd-networkd[176]: vethd95335d: Link UP
Jun  2 16:04:28 docker networkd-dispatcher[207]: WARNING:Unknown index 3 seen, reloading interface list
Jun  2 16:04:28 docker systemd-udevd[2637]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  2 16:04:28 docker systemd-udevd[2637]: Using default interface naming scheme 'v245'.
Jun  2 16:04:28 docker systemd-udevd[2638]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  2 16:04:28 docker systemd-udevd[2637]: veth06e9a85: Could not generate persistent MAC: No data available
Jun  2 16:04:28 docker systemd-udevd[2638]: Using default interface naming scheme 'v245'.
Jun  2 16:04:28 docker systemd-udevd[2638]: vethd95335d: Could not generate persistent MAC: No data available
Jun  2 16:04:28 docker containerd[905]: time="2021-06-02T16:04:28.479986977Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703 pid=2654
Jun  2 16:04:28 docker systemd[1]: Started libcontainer container 93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703.
Jun  2 16:04:28 docker systemd[1]: docker-93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703.scope: Succeeded.
Jun  2 16:04:28 docker containerd[905]: time="2021-06-02T16:04:28.548994221Z" level=info msg="shim disconnected" id=93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703
Jun  2 16:04:28 docker dockerd[1118]: time="2021-06-02T16:04:28.549253741Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 16:04:28 docker dockerd[1118]: time="2021-06-02T16:04:28.549286841Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 16:04:28 docker systemd-networkd[176]: vethd95335d: Link DOWN
Jun  2 16:04:28 docker systemd-networkd[176]: rtnl: received neighbor for link '4' we don't know about, ignoring.
Jun  2 16:04:28 docker systemd-networkd[176]: rtnl: received neighbor for link '4' we don't know about, ignoring.
Jun  2 16:04:28 docker dockerd[1118]: time="2021-06-02T16:04:28.722331441Z" level=error msg="93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703 cleanup: failed to delete container from containerd: no such container"
Jun  2 16:04:28 docker dockerd[1118]: time="2021-06-02T16:04:28.722363971Z" level=error msg="Handler for POST /v1.41/containers/93248ab0298f22d8899d8f9d51335962b37a7baeaa39b054ef929389dbc74703/start returned error: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: can't load program: operation not permitted: unknown"

I suspect the issue is with cgroupv2. That support in Docker is still quite new and likely hitting some kind of issue when combined with nesting.

We have daily automated tests for running Docker in LXD and those haven’t failed in a very long time, but they test current Docker git on LXD servers running with cgroupv1.

Gotcha. Well, as of Docker 20.10, they’re supposed to have support for cgroup V2.

This is definitely a docker issue? How can I verify?

The couldn't load program here suggests to me that they’re trying to load a devices cgroup policy which cannot be done from within an unprivileged container.

It’d be great if Docker could just log a message and move on in such cases as when running inside an unprivileged container, your access to devices is already restricted by the parent container. Systemd for example has that approach of logging it but then moving on.

You could try setting security.syscalls.intercept.bpf=true and security.syscalls.intercept.bpf.devices=true and then restart your container. What this will do is cause LXD to intercept all ebpf syscalls, run them through a policy engine as real root on the host and if the ebpf program is just meant to attach to a cgroupv2 devices controller, it will load it for you as real root.

This can come with some security and denial of service concerns though as you’re effectively allowing an unprivileged container to trigger logic on the host which will run as real root. That’s why none of those options are enabled by default.

@cyphar is that a runc error? If so, any issue with having it detect the ebpf EPERM combined with running inside an unpriv container and turning this into a warning rather than an error?

In order to run Docker inside a LXD container the security.nesting property of the container should be set to true .
Note that LXD containers cannot load kernel modules, so depending on your Docker configuration you may need to have the needed extra kernel modules loaded by the host.

You can do so by setting a comma separate list of kernel modules that your container needs with:
lxc config set linux.kernel_modules

@darmon77 It’s all set:

$ lxc config show docker
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
architecture: x86_64
config:
image.architecture: x86_64
image.description: Ubuntu 20.04 LTS server (20210510)
image.os: ubuntu
image.release: focal
linux.kernel_modules: arp_tables,br_netfilter,ip6table_filter,iptable_filter,overlay
security.nesting: “true”
volatile.base_image: fa972674c1b11e54f47e99fac4ecfb57e1a019faa70e654dbfe0f11712655c1a
volatile.eth0.host_name: veth04c1e441
volatile.eth0.hwaddr: 00:16:3e:c9:8e:69
volatile.idmap.base: “0”
volatile.idmap.current: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.idmap.next: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.idmap: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.power: RUNNING
volatile.uuid: a202841b-c4d5-4f69-a181-6407b5b13efb
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: “”

@stgraber I have similar experience with privileged container. But I will try your suggestions.

@stgraber

I have applied these configs:

$ lxc config show docker                                            
WARNING: cgroup v2 is not fully supported yet, proceeding with partial confinement
architecture: x86_64
config:
  image.architecture: x86_64
  image.description: Ubuntu 20.04 LTS server (20210510)
  image.os: ubuntu
  image.release: focal
  linux.kernel_modules: arp_tables,br_netfilter,ip6table_filter,iptable_filter,overlay
  security.nesting: "true"
  security.syscalls.intercept.bpf: "true"
  security.syscalls.intercept.bpf.devices: "true"
  volatile.base_image: fa972674c1b11e54f47e99fac4ecfb57e1a019faa70e654dbfe0f11712655c1a
  volatile.eth0.host_name: veth1b2f173d
  volatile.eth0.hwaddr: 00:16:3e:c9:8e:69
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: a202841b-c4d5-4f69-a181-6407b5b13efb
devices: {}
ephemeral: false
profiles:
- default
stateful: false
description: ""

I’m still seeing the following:

error

$ sudo docker run hello-world 
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: can't load program: operation not permitted: unknown.
ERRO[0000] error waiting for container: context canceled 

logs

Jun  2 19:44:29 docker systemd-networkd[178]: vethf8c705c: Link UP
Jun  2 19:44:29 docker networkd-dispatcher[211]: WARNING:Unknown index 7 seen, reloading interface list
Jun  2 19:44:29 docker systemd-udevd[1162]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  2 19:44:29 docker systemd-udevd[1162]: Using default interface naming scheme 'v245'.
Jun  2 19:44:29 docker systemd-udevd[1162]: veth8e97b63: Could not generate persistent MAC: No data available
Jun  2 19:44:29 docker systemd-udevd[1163]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  2 19:44:29 docker systemd-udevd[1163]: Using default interface naming scheme 'v245'.
Jun  2 19:44:29 docker systemd-udevd[1163]: vethf8c705c: Could not generate persistent MAC: No data available
Jun  2 19:44:30 docker containerd[241]: time="2021-06-02T19:44:29.999932313Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d pid=1180
Jun  2 19:44:30 docker systemd[1]: Attaching device control BPF program to cgroup /system.slice/docker-a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d.scope failed: Operation not permitted
Jun  2 19:44:30 docker systemd[1]: Started libcontainer container a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d.
Jun  2 19:44:30 docker systemd[1]: Attaching device control BPF program to cgroup /system.slice/docker-a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d.scope failed: Operation not permitted
Jun  2 19:44:30 docker systemd[1]: docker-a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d.scope: Succeeded.
Jun  2 19:44:30 docker systemd[1]: Stopped libcontainer container a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d.
Jun  2 19:44:30 docker containerd[241]: time="2021-06-02T19:44:30.085058802Z" level=info msg="shim disconnected" id=a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d
Jun  2 19:44:30 docker dockerd[302]: time="2021-06-02T19:44:30.085393882Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 19:44:30 docker dockerd[302]: time="2021-06-02T19:44:30.085443862Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 19:44:30 docker systemd-networkd[178]: vethf8c705c: Link DOWN
Jun  2 19:44:30 docker systemd-networkd[178]: rtnl: received neighbor for link '8' we don't know about, ignoring.
Jun  2 19:44:30 docker systemd-networkd[178]: rtnl: received neighbor for link '8' we don't know about, ignoring.
Jun  2 19:44:30 docker dockerd[302]: time="2021-06-02T19:44:30.229934562Z" level=error msg="a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d cleanup: failed to delete container from containerd: no such container"
Jun  2 19:44:30 docker dockerd[302]: time="2021-06-02T19:44:30.229986822Z" level=error msg="Handler for POST /v1.41/containers/a8780af0b35f4801788bdd103496d0ad1ca8ef9edf04a347a3db018ceba6776d/start returned error: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: can't load program: operation not permitted: unknown"

So this makes me think it’s something else (runc, like you mentioned?)

With security.privileged=true:

Error

$ sudo docker run hello-world 

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

docker: Error response from daemon: ttrpc: closed: unknown.
ERRO[0000] error waiting for container: context canceled 

Logs

Jun  2 20:17:36 docker systemd-networkd[168]: veth7aeb2f7: Link UP
Jun  2 20:17:36 docker networkd-dispatcher[200]: WARNING:Unknown index 3 seen, reloading interface list
Jun  2 20:17:36 docker kernel: [18010.861635] docker0: port 1(veth7aeb2f7) entered blocking state
Jun  2 20:17:36 docker kernel: [18010.861637] docker0: port 1(veth7aeb2f7) entered disabled state
Jun  2 20:17:36 docker kernel: [18010.861665] device veth7aeb2f7 entered promiscuous mode
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.622281101Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 pid=997
Jun  2 20:17:36 docker systemd[1]: Started libcontainer container 3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53.
Jun  2 20:17:36 docker systemd-networkd[168]: veth7aeb2f7: Gained carrier
Jun  2 20:17:36 docker systemd-networkd[168]: docker0: Gained carrier
Jun  2 20:17:36 docker kernel: [18011.103311] IPv6: ADDRCONF(NETDEV_CHANGE): veth7aeb2f7: link becomes ready
Jun  2 20:17:36 docker kernel: [18011.103341] docker0: port 1(veth7aeb2f7) entered blocking state
Jun  2 20:17:36 docker kernel: [18011.103378] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.820867900Z" level=error msg="loading cgroup for 1019" error="cgroups: cgroup mountpoint does not exist"
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.831200854Z" level=error msg="loading cgroup for 1019" error="cgroups: cgroup mountpoint does not exist"
Jun  2 20:17:36 docker systemd[1]: docker-3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53.scope: Succeeded.
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.833900025Z" level=info msg="shim disconnected" id=3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.833947385Z" level=warning msg="cleaning up after shim disconnected" id=3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 namespace=moby
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.833972645Z" level=info msg="cleaning up dead shim"
Jun  2 20:17:36 docker dockerd[272]: time="2021-06-02T20:17:36.834343575Z" level=error msg="failed to delete task after fail start" container=3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 error="not found" module=libcontainerd namespace=moby
Jun  2 20:17:36 docker containerd[224]: time="2021-06-02T20:17:36.841494037Z" level=warning msg="cleanup warnings time=\"2021-06-02T20:17:36Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1077\n"
Jun  2 20:17:36 docker dockerd[272]: time="2021-06-02T20:17:36.841931467Z" level=info msg="ignoring event" container=3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  2 20:17:36 docker systemd-networkd[168]: veth7aeb2f7: Lost carrier
Jun  2 20:17:36 docker networkd-dispatcher[200]: WARNING:Unknown index 3 seen, reloading interface list
Jun  2 20:17:36 docker systemd-networkd[168]: veth7aeb2f7: Link DOWN
Jun  2 20:17:36 docker kernel: [18011.270696] docker0: port 1(veth7aeb2f7) entered disabled state
Jun  2 20:17:36 docker systemd-networkd[168]: rtnl: received neighbor for link '4' we don't know about, ignoring.
Jun  2 20:17:36 docker systemd-networkd[168]: rtnl: received neighbor for link '4' we don't know about, ignoring.
Jun  2 20:17:37 docker systemd[822]: run-docker-netns-54aac26d4284.mount: Succeeded.
Jun  2 20:17:37 docker systemd[1]: run-docker-netns-54aac26d4284.mount: Succeeded.
Jun  2 20:17:37 docker dockerd[272]: time="2021-06-02T20:17:37.078877009Z" level=error msg="3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 cleanup: failed to delete container from containerd: no such container"
Jun  2 20:17:37 docker dockerd[272]: time="2021-06-02T20:17:37.078912279Z" level=error msg="Handler for POST /v1.41/containers/3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53/start returned error: ttrpc: closed: unknown"
Jun  2 20:17:37 docker dockerd[272]: time="2021-06-02T20:17:37.079306899Z" level=warning msg="error locating sandbox id 54aac26d42844ea9429c99833d7204488dddc2b33024860cdfe9302d6d7facd5: sandbox 54aac26d42844ea9429c99833d7204488dddc2b33024860cdfe9302d6d7facd5 not found"
Jun  2 20:17:37 docker dockerd[272]: time="2021-06-02T20:17:37.079523610Z" level=error msg="3155e403f37806f700987fee45d63c8d1733c9d00f8718af4175515dfe5d4c53 cleanup: failed to delete container from containerd: no such container"
Jun  2 20:17:37 docker systemd-networkd[168]: docker0: Lost carrier
Jun  2 20:17:38 docker systemd-networkd[168]: docker0: Gained IPv6LL

Additionally, the container always gets created but won’t start.

Error

$ sudo docker start a59c6dec8af6
Error response from daemon: OCI runtime create failed: container with id exists: a59c6dec8af62ae58bd19e87abb82f9b0153544134424d19cf2b0f9e223359cd: unknown
Error: failed to start containers: a59c6dec8af6

Logs

Jun  2 20:22:22 docker systemd-networkd[168]: vethf5652cf: Link UP
Jun  2 20:22:22 docker networkd-dispatcher[200]: WARNING:Unknown index 11 seen, reloading interface list
Jun  2 20:22:22 docker kernel: [18296.816143] docker0: port 1(vethf5652cf) entered blocking state
Jun  2 20:22:22 docker kernel: [18296.816190] device vethf5652cf entered promiscuous mode
Jun  2 20:22:22 docker kernel: [18296.816262] docker0: port 1(vethf5652cf) entered forwarding state
Jun  2 20:22:22 docker containerd[224]: time="2021-06-02T20:22:22.586690461Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/a59c6dec8af62ae58bd19e87abb82f9b0153544134424d19cf2b0f9e223359cd pid=1548
Jun  2 20:22:22 docker containerd[224]: time="2021-06-02T20:22:22.612693720Z" level=info msg="shim disconnected" id=a59c6dec8af62ae58bd19e87abb82f9b0153544134424d19cf2b0f9e223359cd
Jun  2 20:22:22 docker dockerd[272]: time="2021-06-02T20:22:22.613105640Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 20:22:22 docker dockerd[272]: time="2021-06-02T20:22:22.613123340Z" level=error msg="stream copy error: reading from a closed fifo"
Jun  2 20:22:22 docker systemd-networkd[168]: vethf5652cf: Link DOWN
Jun  2 20:22:22 docker systemd-networkd[168]: rtnl: received neighbor for link '12' we don't know about, ignoring.
Jun  2 20:22:22 docker systemd-networkd[168]: rtnl: received neighbor for link '12' we don't know about, ignoring.
Jun  2 20:22:22 docker dockerd[272]: time="2021-06-02T20:22:22.745645346Z" level=error msg="a59c6dec8af62ae58bd19e87abb82f9b0153544134424d19cf2b0f9e223359cd cleanup: failed to delete container from containerd: no such container"
Jun  2 20:22:22 docker dockerd[272]: time="2021-06-02T20:22:22.745675346Z" level=error msg="Handler for POST /v1.41/containers/a59c6dec8af6/start returned error: OCI runtime create failed: container with id exists: a59c6dec8af62ae58bd19e87abb82f9b0153544134424d19cf2b0f9e223359cd: unknown"

Yes, this error message comes from the cilium eBPF library – what’s happening is that bpf(BPF_LOAD_PROG) is blocked so the eBPF program loading fails. The issue is that runc has a hard requirement for devices cgroup restrictions since it’s a fairly core security feature (we do the same for cgroupv1).

On paper we could loosen the requirement under user namespaces but I would need to think about it a little bit more.

@brauner any idea why our ebpf interception didn’t help?

The interception must’ve worked since runc isn’t complaining about devices cgroup in the second case anymore after you’ve enabled bpf interception. So not sure.