Container with nvidia.runtime=true refuse to start after reboot of the host

It sounds like your NVIDIA CUDA containers won’t start unless you manually run nvidia-smi on the host first. This is likely because the nvidia_uvm module isn’t loading automatically at boot. Here’s how to fix it:

  1. Load the nvidia_uvm Module:

    sudo modprobe nvidia_uvm
    
  2. Ensure the Module Loads at Boot:
    Add nvidia_uvm to the /etc/modules file:

    echo "nvidia_uvm" | sudo tee -a /etc/modules
    
  3. Create and Configure /etc/rc.local:
    Make sure /etc/rc.local is executable:

    sudo chmod +x /etc/rc.local
    

    Add the following to /etc/rc.local to run nvidia-smi at startup:

    #!/bin/bash
    /usr/bin/nvidia-smi
    exit 0
    
  4. Enable and Start rc.local:

    sudo systemctl enable rc.local
    sudo systemctl start rc.local
    
  5. Reboot and Verify:
    Reboot your system:

    sudo reboot
    

    Check logs to ensure there are no errors:

    dmesg | grep nvidia
    journalctl -u rc.local
    
  6. Check Container Runtime Configuration:
    Make sure nvidia.runtime=true is set in your container configuration.

These steps should ensure the nvidia_uvm module loads at boot, allowing your containers to start properly.

1 Like