Incus, docker and NVIDIA GPU

friki67 · February 2, 2024, 9:29am

Hello. I’m on my way changing from LXD to Incus. I’ve posted this in LXD forum, but as I’m changing, I think I can ask here too.

This is an old question. I’ve been looking around in there and in Linux Containers forum and the problem exists as per last July.

The thing is, I cannot run a docker container using the gpu inside an LXC container. The container has gpu accesible, you can run nvidia-smi and get the correct response and if you install the application directly it works and uses the gpu. But if you try to launch its docker container using the gpu resource it just can’t start. See
https://discuss.linuxcontainers.org/t/how-to-build-nvidia-docker-inside-lxd-lxc-container/17582/10
https://discuss.linuxcontainers.org/t/gpu-in-a-docker-instance/15085
https://discuss.linuxcontainers.org/t/docker-with-gpu-support-in-unprivileged-lxd-container/5783
and https://discuss.linuxcontainers.org/search?q=docker%20gpu

I understand that it would be fixed if I can use a privileged LXC with nvidia capabilities enabled, but it does not seem possible.

Righ now I’m using snap LXD 5.15, ubuntu 22.04 container and last docker package installed in the container.

Any chance to get a docker container using gpu inside a LXC?

simos · February 2, 2024, 12:13pm

I do not know whether the LXD snap package adds some extra level of indirection.

I suggest to try with Incus and document the steps you take (I mean, post them here).
Show the steps like this person here, GPU in a docker instance

What you are facing is not that the GPU is not accessible from Docker in a container, but rather that the specific NVidia Docker application container has certain, perhaps unrelated, requirements that are not met. Hence, this specific NVidia Docker application container does not start.

I had a similar issue when I was trying to run Telegram in a GUI container. The application wanted access to the console, otherwise it would crash.

stgraber · February 2, 2024, 2:26pm

So the weird thing here is that NVIDIA contributed both the LXC and Docker integrations.

But they made it so that the LXC integration only works with unprivileged containers whereas the Docker integration only works with privileged containers, so that’s how we end up with this weird mess.

The only workaround I’m aware of is to not use nvidia.runtime on the Incus side but instead go through the annoying process of installing all the NVIDIA packages directly in the Incus container, at which point, that container can be privileged and the Docker support should work as expected.

friki67 · February 3, 2024, 10:03am

Thank you @simos and @stgraber . Waiting for the Fedora package compatible with EL9.

So, to make it work, I have to

incus config device add u31 gpu gpu id=0

And not set (unset in my case)

  nvidia.driver.capabilities: all
  nvidia.runtime: "true"

And then install NVIDIA driver and container runtime into the container.

Is this correct?

stgraber · February 3, 2024, 10:22am

Yep, that’s right

C0rn3j · August 12, 2024, 4:24pm

Using model: xtts
Traceback (most recent call last):
  File "/usr/local/bin/tts", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/TTS/bin/synthesize.py", line 441, in main
    ).to(device)
      ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
    return t.to(
           ^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Right, so that’s why I couldn’t get Torch to run.
Leaving the error trace here for easier indexing.

EDIT: So yeah the actual error is as follows, I was just running into another issue due to incorrect Docker documentation - Not specifying `count` parameter for enabling GPU access breaks the feature, while docs claim it is not necessary · Issue #20588 · docker/docs · GitHub

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown