Install NVIDIA KVM driver on the host machine, how to use CUDA in MIG instance

yan.leitao · August 30, 2024, 10:00am

Hi,

I installed the NVIDIA KVM driver on the host machine and used the mig and mdev functions at the same time.
However, after using the mig function to mount the graphics card in the container, the CUDA version is N/A, and CUDA needs to be installed separately. Because nvidia.runtime is set to true, the grid driver cannot be installed separately in the container.
My questions:

When using the mig function to mount the graphics card in the container, must nvidia.runtime be set to true?
How to install the CUDA driver in the container? Is there any way to meet my needs without installing CUDA separately?

Host graphics card status：

Container graphics card status

Thank you very much!

simos · August 30, 2024, 12:00pm

Hi!

nvidia.runtime=true is a convenience to get the NVidia runtime into your container. If you do not use it, you need to install the NVidia driver by yourself in the container, which means that it will probably install other stuff that are not needed.
Therefore, the practice is to try to make do with nvidia.runtime=true and if that really cannot be done, you end up installing the driver in the container by yourself.

Having said that, when you add the gpu device (any gputype) to the container, this is a prerequisite to use the GPU in the container and does not demand the existence of nvidia.runtime=true.

I do not understand the other NVidia details you are mentioning (grid driver).

I do not know which are the minimal distro packages that can effectively replace nvidia.runtime=true. If should be easy to figure them out with trial and error.

osch · September 3, 2024, 2:19am

Hello,

as far as I know and experienced myself nvidia.runtime=true has some limitations. In particular on default container settings. There are quite a few other posts around this topic:

For all my instances I added the gpu as a device and installed the driver into the container to get it all working smoothly. With a little try and error you will find out the minimum required packages for your application. Important is to install exactly the same version as on your host otherwise you run into some wired situations.

There is a small catch with latest NVidia drivers and CUDA starting with version 545? It requiresto run a small program on the host, something like bandwidth test, before incus starts the instances. Otherwise there will be no NVidia support in the container. Haven’t tried newer versions if this issue is still present.

Hope it helps

friki67 · September 3, 2024, 6:59am

THIS! Is happening too in 550.

This fixed my Docker into Incus and NVIDIA - #4 by friki67

Thank you very much!

simos · September 3, 2024, 8:43am

Therefore, this is some dummy program that will run on the host once, and that will somewhat activate the unused GPU so that it will be available then in the container?

There should be some background into this, either this affects all users of the NVidia container runtime, or it’s something very specific to LXC. I would look closely into the NVidia container runtime support to figure out what happened.

osch · September 4, 2024, 1:32am

As mentioned for some reason the latest NVidia driver are acting different during system boot to initialize the GPU support. I stumbled across this during my research on why my card didn’t fully work in Incus after system reboot. Some people mentioned their X-Server didn’t come up or similar. Not sure if NVidia will fix it anytime soon. DOn’t have that link handy but you will find it.

There are quite a few topics around this issue and so far I haven’t found the golden solution that works out of the box. For example Container with nvidia.runtime=true refuse to start after reboot of the host is more or less doing the same like I mentioned “run a small proc to intialize”. I haven’t tried the solution mentioned in this topic and if nvidia.runtime=true will work. In my case I run into some permissions issues following this path during testing, so I decided to install drivers locally. May revisit it again if time permits…

For now there is a solution / workaround that keeps my container applications happy.