Container with nvidia.runtime=true refuse to start after reboot of the host

ashni · June 10, 2024, 9:47am

It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi on the host first. This is likely because the nvidia_uvm module isn’t loading automatically at boot. Here’s how to fix it:

Load the nvidia_uvm Module:
```
sudo modprobe nvidia_uvm
```
Ensure the Module Loads at Boot:
Add nvidia_uvm to the /etc/modules file:
```
echo "nvidia_uvm" | sudo tee -a /etc/modules
```
Create and Configure /etc/rc.local:
Make sure /etc/rc.local is executable:
```
sudo chmod +x /etc/rc.local
```
Add the following to /etc/rc.local to run nvidia-smi at startup:
```
#!/bin/bash
/usr/bin/nvidia-smi
exit 0
```

Enable and Start rc.local:

sudo systemctl enable rc.local
sudo systemctl start rc.local

Reboot and Verify:
Reboot your system:
```
sudo reboot
```
Check logs to ensure there are no errors:
```
dmesg | grep nvidia
journalctl -u rc.local
```
Check Container Runtime Configuration:
Make sure nvidia.runtime=true is set in your container configuration.

These steps should ensure the nvidia_uvm module loads at boot, allowing your containers to start properly.