I am attempting to get an NVIDIA HGX A100 gpu passed through into an LXD VM without much success. I have had success with getting one passed through (both via physical and mig) to a container.
I’ve found and watched a couple videos on the topic which provide instruction for a VM
https://www.youtube.com/watch?v=T0aV2LsMpoA
https://www.youtube.com/watch?v=1i45zTu42i0
Had a couple questions I am hoping someone can answer
-
It appears in order to create the SRIOV devices in the video, it requires the vGPU / GRID driver. Can someone confirm this?
1a) Understanding it is a licensed product, does anyone know of a vGPU / GRID specific to a vanilla Ubuntu20.04 installation? Having some issue with the run file NVIDIA provided appearing to be expecting XEN. -
It appears (and i assume) the A100 in the video is not the HGX variant. Can someone confirm this?
2a) The HGX is different than the PCIe version of the A100 in that the GPUs are connected together via NVLink/NVSwitch. I’ve read some places that due to this, using the pure PCIe pass thru method (not using the vGPU or Datacenter driver) would require passing through any connected NVSwitch’s for a given GPU thus requiring all GPU’s and NVSwitch’s to be passed in since they are all connected (I am guessing this is why the vGPU abstraction is required for VM’s). Does anyone know if its possible to pass a single GPU on an HGX platform through to a VM either not using host drivers at all (pure PCIe passthru) or with the datacenter driver (physical or mig)?
Would appreciate any assistance/insight that can be provided.