NVidia CUDA inside a LXD container Ubuntu 20.04

emcp · May 29, 2021, 11:41am

I searched here and on stackoverflow but have not gotten past this yet… I am running Ubuntu 20.04 LTS with CUDA 11.2 installed successfully on my host… but when I followed the instructions for lxc container to get CUDA… It fails to launch with this message

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

I tried suggestions like… rebooting the LXC … which does nothing… I am going to now try to learn if there’s something else I need to do like… install the NVIDIA drivers first before CUDA within the LXC image… but if others have been here before please chime in with suggestions. thanks

emcp · May 29, 2021, 12:12pm

just want to add on here… I wanted to verify my version of CUDA in the host… only to be more confused because if I follow the instructions and do apt install nvidia-cuda-toolkit I am getting another version from nvidia-smi

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ nvidia-smi 
Sat May 29 14:10:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:09:00.0  On |                  N/A |
|  0%   38C    P8    18W / 370W |    503MiB / 24234MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1103      G   /usr/lib/xorg/Xorg                102MiB |
|    0   N/A  N/A      6180      G   /usr/lib/xorg/Xorg                215MiB |
|    0   N/A  N/A     11489      G   /usr/bin/gnome-shell              146MiB |
|    0   N/A  N/A    121372      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A    140809      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A    141326      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A    654171      G   /usr/lib/firefox/firefox            4MiB |
|    0   N/A  N/A    691384      G   /usr/lib/firefox/firefox            4MiB |
+-----------------------------------------------------------------------------+

I am beginning to wonder if I need to just… ignore those instructions and install cuda via anaconda… or something… but I’d like to understand so I can sync installations properly across my LXD containers properly.

emcp · May 29, 2021, 4:01pm

Okay I solved this… but let me leave this up here because this is kind of tricky…

just because your hosts nvidia-smi shows a CUDA version… it is not what is installed per se… only what is supported. I had simply forgotten to install CUDA in the host and had skipped straight into the LXD container… but I restarted with the host, rebooted after CUDA 11.3u1 installed… redid it in the LXD image and voila… it’s working

simos · May 29, 2021, 11:17pm

The tutorial at NVidia CUDA inside a LXD container | Ubuntu is old and has been superseded. NVidia has created a runtime with the user-space libraries for their cards, and LXD can insert that runtime automatically for you into the container. By doing so, you do not need to manually install NVidia packages in your containers so that they match the version of the host’s NVidia kernel driver.

You would add the following into your LXD profile to enable this NVidia runtime into your container. The all is very permissive. If you just want to expose the compute capability to the container, replace all with compute.

  nvidia.driver.capabilities: all
  nvidia.runtime: "true"

It might help you to have a look at Running X11 software in LXD containers – Mi blog lah! which shows how to run GUI apps into a LXD container. It uses nvidia.runtime: "true", and puts that configuration in a separate LXD profile. By doing so, you can selectively launch containers that have access to the NVidia GPU.

emcp · September 25, 2021, 4:15pm

I am coming back around to test things… and wondered how this will work with Juju… my favored orchestration tool for LXD/LXC

I’m fiddling again and not seeing where I set or edit profiles of LXC/LXD … but I looked up profiles and all i got so far was this

https://linuxcontainers.org/lxd/docs/master/profiles

it doesn’t seem to show how to list profiles or anything to edit it (guessing it’s a file)

I’ll keep looking but meanwhile will be attempting to manually install CUDA within the container… because we’re using Juju and I can simply roll the installer into the hook/install