Hello,
I’m just starting out with Incus and passthrough GPU and I am having a bit of trouble getting SR-IOV to work. I’ll be as brief as possible.
- I’ve installed Ubuntu Server 24.04.2 on a Dell PowerEdge R730
- An NVIDIA Tesla M60 card has been installed in the server; the card is definitely in Graphics mode, tested with
gpumodeswitch
utility - I’ve enabled Global SR-IOV in the BIOS of the server; I’ve disabled Secure Boot to prevent kernel lockdown
- Incus 6.11 is installed from Zabbly stable APT repo
- The latest NVIDIA-GRID-Ubuntu-KVM host drivers have been installed and nouveau blacklisted so that they can take effect
- I have not as yet installed any GRID licensing server
- The card is present and recognised as two GPUs:
$ nvidia-smi
Tue Apr 15 08:40:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.03 Driver Version: 570.124.03 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla M60 On | 00000000:84:00.0 Off | Off |
| N/A 37C P8 24W / 150W | 19MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla M60 On | 00000000:85:00.0 Off | Off |
| N/A 33C P8 24W / 150W | 19MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Unfortunately when I attempt to add the card to a VM instance with incus config device add <instance> gpu-1 gpu gputype=sriov pci=0000:84:00.0
and start the VM I get the message Failed to start device "gpu-1": Couldn't find a matching GPU with available VFs
.
Indeed, I read that in directory /sys/bus/pci/devices/0000\:84\:00.0
I should expect to see a number of driver special files such as max_vfs
but there’s nothing to indicate that SR-IOV is an option for this card.
Any ideas what additional steps I need to take to get SR-IOV working? If I attach one of the Tesla M60 GPUs as a physical passthrough to the VM then I can get the VM to boot, and with the guest NVIDIA drivers installed then the card will be used. I just need to see if I can get SR-IOV working for this proof of concept test.
Edit: Have I made the rookie mistake of believing the M60 to be SR-IOV compatible when it isn’t? I’ve just checked capabilities with lspci and nothing about SR-IOV shows up:
$ sudo lspci -v -s 84:00.0
84:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GM204GL [Tesla M60]
Flags: bus master, fast devsel, latency 0, IRQ 104, NUMA node 1, IOMMU group 9
Memory at c9000000 (32-bit, non-prefetchable) [size=16M]
Memory at 3ffe0000000 (64-bit, prefetchable) [size=256M]
Memory at 3fff0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 9000 [size=128]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
I could have sworn I read that this card was SR-IOV capable; I thought it was the whole point of the card, i.e. for vGPU on a host with lots of workstation guests. Am I wrong?
Thanks.