It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi on the host first. This is likely because the nvidia_uvm module isn’t loading automatically at boot. Here’s how to fix it:
-
Load the
nvidia_uvmModule:sudo modprobe nvidia_uvm -
Ensure the Module Loads at Boot:
Addnvidia_uvmto the/etc/modulesfile:echo "nvidia_uvm" | sudo tee -a /etc/modules -
Create and Configure
/etc/rc.local:
Make sure/etc/rc.localis executable:sudo chmod +x /etc/rc.localAdd the following to
/etc/rc.localto runnvidia-smiat startup:#!/bin/bash /usr/bin/nvidia-smi exit 0 -
Enable and Start
rc.local:sudo systemctl enable rc.local sudo systemctl start rc.local -
Reboot and Verify:
Reboot your system:sudo rebootCheck logs to ensure there are no errors:
dmesg | grep nvidia journalctl -u rc.local -
Check Container Runtime Configuration:
Make surenvidia.runtime=trueis set in your container configuration.
These steps should ensure the nvidia_uvm module loads at boot, allowing your containers to start properly.