GPU in a docker instance

trumee · September 9, 2022, 4:08am

Hello,

I have a LXD installed on an ArchLinux host with a container running Ubuntu-22.04. The container is setup to use docker/portainer. I want to use a GPU installed on the host.

The GPU shows up in the container,

ubuntu@docker1:~$ nvidia-smi 
Fri Sep  9 04:04:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:02:00.0 Off |                  N/A |
| 51%   43C    P8     5W /  75W |      2MiB /  5120MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

However, if i try to use it with docker it fails,

$ docker run --gpus all nvidia/cuda:11.4.0-devel-ubuntu20.04  nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.
ERRO[0000] error waiting for container: context canceled

The nvidia package installed on the container are,

# apt search nvidia|grep installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-container-tools/bionic,now 1.10.0-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.10.0-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.10.0-1 amd64 [installed]

Is it possible to nest a GPU from host>LXD>docker?

Thanks

stgraber · September 9, 2022, 2:48pm

Hmm, the error suggests a devices cgroup issue, I wonder if using security.syscalls.intercept.bpf=true and security.syscalls.intercept.bpf.devices=true may help here.

trumee · September 9, 2022, 3:52pm

Unfortunately, that did not help. Here is what i defined,

$ lxc config show docker1
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20220906_07:43)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20220906_07:43"
  image.type: squashfs
  image.variant: default
  nvidia.driver.capabilities: all
  nvidia.runtime: "true"
  security.nesting: "true"
  security.syscalls.intercept.bpf: "true"
  security.syscalls.intercept.bpf.devices: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
  volatile.base_image: c2ecef7a7e1384d3004d6031086625dd863a2b183ed864cc0346e581567c34b6
  volatile.cloud-init.instance-id: 5cef635c-1e1b-4b97-85fd-c6ea76e949ee
  volatile.eth0.host_name: vethf0952407
  volatile.eth0.hwaddr: 00:16:3e:c7:b6:2a
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: fd0a1eb0-b542-4669-a513-6978c8aecd64
devices:
  dockervolume:
    path: /mnt/docker_volumes
    source: /tank/docker_volumes
    type: disk
  gpu:
    productid: 1c30
    type: gpu
    vendorid: 10de
  mydisk:
    path: /var/lib/docker
    source: /mnt/docker1
    type: disk
  root:
    path: /
    pool: ssdpool4
    type: disk
ephemeral: false
profiles:
- vlan300profile
stateful: false
description: ""

I get this,

$ docker run --gpus all nvidia/cuda:11.4.0-devel-ubuntu20.04  nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.
ERRO[0000] error waiting for container: context canceled

trumee · September 9, 2022, 3:55pm

There is an old post about this here. It is asking to share pci buses, but doesnt talk about how to do it.