I searched here and on stackoverflow but have not gotten past this yet… I am running Ubuntu 20.04 LTS with CUDA 11.2 installed successfully on my host… but when I followed the instructions for lxc container to get CUDA… It fails to launch with this message
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
I tried suggestions like… rebooting the LXC … which does nothing… I am going to now try to learn if there’s something else I need to do like… install the NVIDIA drivers first before CUDA within the LXC image… but if others have been here before please chime in with suggestions. thanks
just want to add on here… I wanted to verify my version of CUDA in the host… only to be more confused because if I follow the instructions and do apt install nvidia-cuda-toolkit I am getting another version from nvidia-smi
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ nvidia-smi
Sat May 29 14:10:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:09:00.0 On | N/A |
| 0% 38C P8 18W / 370W | 503MiB / 24234MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1103 G /usr/lib/xorg/Xorg 102MiB |
| 0 N/A N/A 6180 G /usr/lib/xorg/Xorg 215MiB |
| 0 N/A N/A 11489 G /usr/bin/gnome-shell 146MiB |
| 0 N/A N/A 121372 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 140809 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 141326 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 654171 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 691384 G /usr/lib/firefox/firefox 4MiB |
+-----------------------------------------------------------------------------+
I am beginning to wonder if I need to just… ignore those instructions and install cuda via anaconda… or something… but I’d like to understand so I can sync installations properly across my LXD containers properly.
Okay I solved this… but let me leave this up here because this is kind of tricky…
just because your hosts nvidia-smi shows a CUDA version… it is not what is installed per se… only what is supported. I had simply forgotten to install CUDA in the host and had skipped straight into the LXD container… but I restarted with the host, rebooted after CUDA 11.3u1 installed… redid it in the LXD image and voila… it’s working
The tutorial at NVidia CUDA inside a LXD container | Ubuntu is old and has been superseded. NVidia has created a runtime with the user-space libraries for their cards, and LXD can insert that runtime automatically for you into the container. By doing so, you do not need to manually install NVidia packages in your containers so that they match the version of the host’s NVidia kernel driver.
You would add the following into your LXD profile to enable this NVidia runtime into your container. The all is very permissive. If you just want to expose the compute capability to the container, replace all with compute.
nvidia.driver.capabilities: all
nvidia.runtime: "true"
It might help you to have a look at Running X11 software in LXD containers – Mi blog lah! which shows how to run GUI apps into a LXD container. It uses nvidia.runtime: "true", and puts that configuration in a separate LXD profile. By doing so, you can selectively launch containers that have access to the NVidia GPU.
it doesn’t seem to show how to list profiles or anything to edit it (guessing it’s a file)
I’ll keep looking but meanwhile will be attempting to manually install CUDA within the container… because we’re using Juju and I can simply roll the installer into the hook/install