Docker with GPU-support in unprivileged LXD container

klmmr · September 20, 2019, 10:40am

I am trying to use LXD with nested Docker for applications that run computations on the GPU (e.g. deep learning with tensorflow).

On the host system (Ubuntu 18.04 Server) only LXD and the GPU drivers (nvidia-driver-430) are installed. The usual Ubuntu 18.04 image is used for the LXD container with security.nesting true and the GPU was added to the container via lxc config device add docker gpu gpu, the option nvidia.runtime was not set. I installed nvidia-driver-430 in the LXD container and also the nvidia-container-toolkit as well as Docker. In the LXD container nvidia-smi and docker run hello-world work fine.

When trying to use the GPUs with a Docker container, I get the following error message:

ubuntu@docker:~$ docker run --gpus all nvidia/cuda:10.1-base nvidia-smi
Unable to find image 'nvidia/cuda:10.1-base' locally
10.1-base: Pulling from nvidia/cuda
35c102085707: Pull complete 
251f5509d51d: Pull complete 
8e829fe70a46: Pull complete 
6001e1789921: Pull complete 
9f0a21d58e5d: Pull complete 
47b91ac70c27: Pull complete 
a0529eb74f28: Pull complete 
Digest: sha256:ed16350d934ba7b4cef27cbd5e3e1cdeae9d89a0aedd3f66972bf4377f76ca09
Status: Downloaded newer image for nvidia/cuda:10.1-base
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: write error: /sys/fs/cgroup/devices/docker/8a1bfba9d762f426f8c59a7b44f969fc4ce3b49d386e5fb3305c3ae3d0cf35fb/devices.allow: operation not permitted\\\\n\\\"\"": unknown.
ERRO[0006] error waiting for container: context canceled

There seem to be some kind of permission problem writing to /sys/fs/cgroup/devices/docker/. I also tried to examine the problem using strace but couldn’t find any clue. I verified that the setup should work in general by using the option security.privileged true for the LXD container. For a privileged LXD container everything works fine and nvidia-smi is executed correctly within the Docker container showing the GPU.

However, for my use case it is not possible to use a privileged LXD container. Has anyone faced the same (or a similar) problem? Do you have an idea how to debug this problem further?

Some background information: I am using a GeForce GTX 1080 Ti (using nvidia-driver-430) on Ubuntu Server 18.04.3, with the following versions of LXD and docker.

LXD version (on host):

$ lxc version
Client version: 3.17
Server version: 3.17

Docker version (in LXD container)

$ docker version
Client: Docker Engine - Community
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        6a30dfc
 Built:             Thu Aug 29 05:29:11 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:27:45 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

stgraber · September 20, 2019, 12:55pm

This feels like something that should be reported to nvidia-container upstream. I don’t think this is a case they’ve thought about and clearly not one they test, but it may be something they’d be interested in looking into an fixing.

genesis96839 · November 17, 2020, 2:30pm

Hi. Is there any update to this? I’m myself looking for a similar solution.

mayliszt · June 30, 2022, 6:56pm

Hi @klmmr, I found a solution to the problem, and @stgraber, maybe LXD can take this solution.

The solution is:

After the docker starts, map GPU buses /proc/driver/nvidia/gpus/xxxxxx from the host to the LXD container. The xxxxx pci buses should be those of the GPUs passed through from LXD to docker. The needed pci buses can be identified with nvidia-smi.