It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi
on the host first. This is likely because the nvidia_uvm
module isn’t loading automatically at boot. Here’s how to fix it:
-
Load the
nvidia_uvm
Module:sudo modprobe nvidia_uvm
-
Ensure the Module Loads at Boot:
Addnvidia_uvm
to the/etc/modules
file:echo "nvidia_uvm" | sudo tee -a /etc/modules
-
Create and Configure
/etc/rc.local
:
Make sure/etc/rc.local
is executable:sudo chmod +x /etc/rc.local
Add the following to
/etc/rc.local
to runnvidia-smi
at startup:#!/bin/bash /usr/bin/nvidia-smi exit 0
-
Enable and Start
rc.local
:sudo systemctl enable rc.local sudo systemctl start rc.local
-
Reboot and Verify:
Reboot your system:sudo reboot
Check logs to ensure there are no errors:
dmesg | grep nvidia journalctl -u rc.local
-
Check Container Runtime Configuration:
Make surenvidia.runtime=true
is set in your container configuration.
These steps should ensure the nvidia_uvm
module loads at boot, allowing your containers to start properly.