Container with nvidia.runtime=true refuse to start after reboot of the host

Hi, I have been fighting with my container that use nvidia cuda for a while.

These container will refuse to start if the nvidia.runtime=true is configure.

Has a workaround I have to do nvidia-smi on the host and then start the container.

Host is on debian 12.1

Incus is at 6.2, I had the same issue with incus 6.0.

Nvidia driver are at 545.29.02 and cuda at 12.3.

Any Idea on the subject ?

With this option Incus tries to mount the NVidia container runtime into your container.
Apparently if fails and there should be some relevant error message out there.

This method of installing the NVidia container runtime is a convenience and helps you avoid installing the NVidia libraries by hand.

Tanks Simos for the info.

I have find out that module nvidia_uvm was not loaded upon reboot.

So I created /etc/rc.local file that get execute at startup with nvidia-smi in it.

Enable the rc.local service ( systemctl enable rc.local )

Start the rc.local service ( systemctl start rc.local ; systemctl status rc.local )

verify the log file in /var/log/nvidia_boot

reboot and Voila the incus container are all starting normally.

Here is the small script: ( /etc/rc.local )

/#!/bin/sh -e

#input auto start script here
echo $(date -R) >> /var/log/sys-start.log

#Make sure we load nvidia_uvm us loaded and activated at boot
lsmod | grep nv > /var/log/nvidia_boot
nvidia-smi >> /var/log/nvidia_boot
lsmod | grep nv >> /var/log/nvidia_boot
exit 0

Hope that can help someone else

1 Like

It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi on the host first. This is likely because the nvidia_uvm module isn’t loading automatically at boot. Here’s how to fix it:

  1. Load the nvidia_uvm Module:

    sudo modprobe nvidia_uvm
    
  2. Ensure the Module Loads at Boot:
    Add nvidia_uvm to the /etc/modules file:

    echo "nvidia_uvm" | sudo tee -a /etc/modules
    
  3. Create and Configure /etc/rc.local:
    Make sure /etc/rc.local is executable:

    sudo chmod +x /etc/rc.local
    

    Add the following to /etc/rc.local to run nvidia-smi at startup:

    #!/bin/bash
    /usr/bin/nvidia-smi
    exit 0
    
  4. Enable and Start rc.local:

    sudo systemctl enable rc.local
    sudo systemctl start rc.local
    
  5. Reboot and Verify:
    Reboot your system:

    sudo reboot
    

    Check logs to ensure there are no errors:

    dmesg | grep nvidia
    journalctl -u rc.local
    
  6. Check Container Runtime Configuration:
    Make sure nvidia.runtime=true is set in your container configuration.

These steps should ensure the nvidia_uvm module loads at boot, allowing your containers to start properly.

1 Like

If that’s the issue, setting linux.kernel_modules=nvidia_uvm on the container should ensure the kernel module is loaded prior to the container starting up.

1 Like

I did test this today and it still does not work without the nvidia-smi command in rc.local :frowning:

I wonder what else nvidia-smi is doing when first run then, maybe it’s loading some other kernel modules?

has you can see no additional module loaded !

Will retry with all module in the log file at next boot in 2 minutes

hostplex:~$ cat /var/log/nvidia_boot
nvidia_drm            102400  0
nvidia_modeset       1335296  1 nvidia_drm
nvidia              56221696  1 nvidia_modeset
znvpair               118784  2 zfs,zcommon
video                  65536  2 i915,nvidia_modeset
spl                   122880  6 zfs,icp,zzstd,znvpair,zcommon,zavl
drm_kms_helper        208896  3 drm_display_helper,nvidia_drm,i915
drm                   614400  8 drm_kms_helper,drm_display_helper,nvidia,drm_buddy,nvidia_drm,i915,ttm
Tue Jun 11 15:17:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 41%   36C    P0              N/A /  75W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
nvidia_uvm           1536000  0
nvidia_drm            102400  0
nvidia_modeset       1335296  1 nvidia_drm
nvidia              56221696  2 nvidia_uvm,nvidia_modeset
znvpair               118784  2 zfs,zcommon
video                  65536  2 i915,nvidia_modeset
spl                   122880  6 zfs,icp,zzstd,znvpair,zcommon,zavl
drm_kms_helper        208896  3 drm_display_helper,nvidia_drm,i915
drm                   614400  8 drm_kms_helper,drm_display_helper,nvidia,drm_buddy,nvidia_drm,i915,ttm