Nvidia and Docker in LXD

Hi,

I am tinkering with running Docker containers inside LXD containers (based on Stephane’s excellent video on the LXD channel: https://youtu.be/_fCSSEyiGro).
The “typical” Docker containers work perfectly fine for me, but I wanted to play with the more complex ones, like the CUDA/PyTorch containers e.g. from Nvidia NGC nvcr.io registry.
One issue is well known, i.e. security.privileged=true and nvidia.runtime=true still do not work well together, but the workaround (i.e. installing the nvidia-drivers within LXD container) works fine.

But I was wondering, if instead of using the bulky security.privileged=true option to get those CUDA containers running, maybe some more fine-grained security settings (syscalls, caps, …) would work in this case too ? Similarly to Stephane’s instructions of running “non-nvidia” Docker containers inside an unprivileged LXD container with: security.syscalls.intercept.mknod=true and security.syscalls.intercept.setxattr=true ?

When I unset the security.privileged option in my LXD container (in which the “CUDA” Docker containers work perfectly fine, if it is configured as privileged), trying to run for example:

docker run --rm --gpus all --ipc=host nvidia/cuda:11.4.1-base-ubuntu20.04 nvidia-smi

gives:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: write error: /sys/fs/cgroup/devices/docker/098ad8bf1fdcf4ab72091864933fbc8b67a8f0b30746681ba6ef4082c23245b9/devices.allow: operation not permitted: unknown.

Or is the security.privileged=true the only way to go ?

Thanks,

Waldek

That error should really be non-fatal in the case of nested containers. It may be worth filing an issue against nvidia-container to have them relax error handling on this particular case.

Unprivileged containers aren’t allowed to modify devices.allow/devices.deny but that doesn’t mean the device in question isn’t already allowed (as it is in this case).

Issue opened: https://github.com/NVIDIA/nvidia-docker/issues/1546