Hi, I have been fighting with my container that use nvidia cuda for a while.
These container will refuse to start if the nvidia.runtime=true is configure.
Has a workaround I have to do nvidia-smi on the host and then start the container.
Host is on debian 12.1
Incus is at 6.2, I had the same issue with incus 6.0.
Nvidia driver are at 545.29.02 and cuda at 12.3.
Any Idea on the subject ?
simos
(Simos Xenitellis)
June 9, 2024, 4:47am
2
With this option Incus tries to mount the NVidia container runtime into your container.
Apparently if fails and there should be some relevant error message out there.
This method of installing the NVidia container runtime is a convenience and helps you avoid installing the NVidia libraries by hand.
Tanks Simos for the info.
I have find out that module nvidia_uvm was not loaded upon reboot.
So I created /etc/rc.local file that get execute at startup with nvidia-smi in it.
Enable the rc.local service ( systemctl enable rc.local )
Start the rc.local service ( systemctl start rc.local ; systemctl status rc.local )
verify the log file in /var/log/nvidia_boot
reboot and Voila the incus container are all starting normally.
Here is the small script: ( /etc/rc.local )
/#!/bin/sh -e
#input auto start script here
echo $(date -R) >> /var/log/sys-start.log
#Make sure we load nvidia_uvm us loaded and activated at boot
lsmod | grep nv > /var/log/nvidia_boot
nvidia-smi >> /var/log/nvidia_boot
lsmod | grep nv >> /var/log/nvidia_boot
exit 0
Hope that can help someone else
1 Like
ashni
(Ashley)
June 10, 2024, 9:47am
4
It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi
on the host first. This is likely because the nvidia_uvm
module isn’t loading automatically at boot. Here’s how to fix it:
Load the nvidia_uvm
Module :
sudo modprobe nvidia_uvm
Ensure the Module Loads at Boot :
Add nvidia_uvm
to the /etc/modules
file:
echo "nvidia_uvm" | sudo tee -a /etc/modules
Create and Configure /etc/rc.local
:
Make sure /etc/rc.local
is executable:
sudo chmod +x /etc/rc.local
Add the following to /etc/rc.local
to run nvidia-smi
at startup:
#!/bin/bash
/usr/bin/nvidia-smi
exit 0
Enable and Start rc.local
:
sudo systemctl enable rc.local
sudo systemctl start rc.local
Reboot and Verify :
Reboot your system:
sudo reboot
Check logs to ensure there are no errors:
dmesg | grep nvidia
journalctl -u rc.local
Check Container Runtime Configuration :
Make sure nvidia.runtime=true
is set in your container configuration.
These steps should ensure the nvidia_uvm
module loads at boot, allowing your containers to start properly.
1 Like
stgraber
(Stéphane Graber)
June 11, 2024, 3:08am
5
If that’s the issue, setting linux.kernel_modules=nvidia_uvm
on the container should ensure the kernel module is loaded prior to the container starting up.
1 Like
I did test this today and it still does not work without the nvidia-smi command in rc.local
stgraber
(Stéphane Graber)
June 11, 2024, 7:11pm
7
I wonder what else nvidia-smi
is doing when first run then, maybe it’s loading some other kernel modules?
has you can see no additional module loaded !
Will retry with all module in the log file at next boot in 2 minutes
hostplex:~$ cat /var/log/nvidia_boot
nvidia_drm 102400 0
nvidia_modeset 1335296 1 nvidia_drm
nvidia 56221696 1 nvidia_modeset
znvpair 118784 2 zfs,zcommon
video 65536 2 i915,nvidia_modeset
spl 122880 6 zfs,icp,zzstd,znvpair,zcommon,zavl
drm_kms_helper 208896 3 drm_display_helper,nvidia_drm,i915
drm 614400 8 drm_kms_helper,drm_display_helper,nvidia,drm_buddy,nvidia_drm,i915,ttm
Tue Jun 11 15:17:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02 Driver Version: 545.29.02 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:01:00.0 Off | N/A |
| 41% 36C P0 N/A / 75W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvidia_uvm 1536000 0
nvidia_drm 102400 0
nvidia_modeset 1335296 1 nvidia_drm
nvidia 56221696 2 nvidia_uvm,nvidia_modeset
znvpair 118784 2 zfs,zcommon
video 65536 2 i915,nvidia_modeset
spl 122880 6 zfs,icp,zzstd,znvpair,zcommon,zavl
drm_kms_helper 208896 3 drm_display_helper,nvidia_drm,i915
drm 614400 8 drm_kms_helper,drm_display_helper,nvidia,drm_buddy,nvidia_drm,i915,ttm