LXD and NVIDIA HGX A100

cterror · March 28, 2023, 6:15pm

I am attempting to get an NVIDIA HGX A100 gpu passed through into an LXD VM without much success. I have had success with getting one passed through (both via physical and mig) to a container.

I’ve found and watched a couple videos on the topic which provide instruction for a VM
https://www.youtube.com/watch?v=T0aV2LsMpoA
https://www.youtube.com/watch?v=1i45zTu42i0

Had a couple questions I am hoping someone can answer

It appears in order to create the SRIOV devices in the video, it requires the vGPU / GRID driver. Can someone confirm this?
1a) Understanding it is a licensed product, does anyone know of a vGPU / GRID specific to a vanilla Ubuntu20.04 installation? Having some issue with the run file NVIDIA provided appearing to be expecting XEN.
It appears (and i assume) the A100 in the video is not the HGX variant. Can someone confirm this?
2a) The HGX is different than the PCIe version of the A100 in that the GPUs are connected together via NVLink/NVSwitch. I’ve read some places that due to this, using the pure PCIe pass thru method (not using the vGPU or Datacenter driver) would require passing through any connected NVSwitch’s for a given GPU thus requiring all GPU’s and NVSwitch’s to be passed in since they are all connected (I am guessing this is why the vGPU abstraction is required for VM’s). Does anyone know if its possible to pass a single GPU on an HGX platform through to a VM either not using host drivers at all (pure PCIe passthru) or with the datacenter driver (physical or mig)?

Would appreciate any assistance/insight that can be provided.

stgraber · March 29, 2023, 3:08am

Yeah, SR-IOV on A100 is only there to add memory protection (through separate PCIe addresses) around normal mdev (vGPU) functions. It can’t be used on its own the way you would on say AMD server GPUs.

As for the driver, there are NVIDIA .deb version of the vGPU host and guest drivers which work just fine. Our CI runs on 22.04 these days but I believe we had 20.04 versions of that too (though maybe that was pre-release stuff).

Right, the card we use in our dev, CI and demos is a PCIe version of the A100.
I don’t know that it being PCIe vs HGX really matters there as the A100 also has NVLINK fingers, so if I somehow had more than one A100, I could connect it through NVLINK to another card. I believe cards can also be connected to ConnectX network cards.

In such a setup, I can definitely see things getting quite messy pretty quickly.
From a PCIe IOMMU standpoint, you’re probably fine as the controller will still see each card in its own group. But the driver may have a very bad day as you may have the other card actively sharing memory with the card now being moved, causing all kind of weird issues or crashes.

Again, I just have the one card so can’t test this, but I’d hope that the NVIDIA driver has some kind of checks for this and will refuse to load if it detects just one card when it knows that card is connected to another over NVLINK.

As for why the vGPU stuff exists. The reason for it is more about flexibility.
Normal full card passthrough or SR-IOV isn’t very flexible about slicing and resource allocation. Using mdev as is used by NVIDIA vGPU allows for exposing a very large number of individual profiles that you can pick from, every time updating the list of the remaining profiles which can still be used. Another benefit is that you’re not limited to the maximum number of PCIe devices on a single switch, so in theory could run hundreds of VMs through vGPU (though I don’t know of a card that can do that, the max number of the smallest profile usually isn’t all that high).

Hope that helps a bit.

cterror · March 29, 2023, 4:18am

Appreciate the info. That does help to clarify a few things.

Ironically right after I posted this, NVIDIA got me pointed to the appropriate driver which I was able to get installed successfully.

My goal is to functionally get a cards worth of resource into a VM (whether that be through passthru or abstraction). The reason for a VM vs container is so that a given driver/cuda version could be defined for a given use case (differ between instances). The current solution for containers requires all containers to inherit the driver/cuda version installed on the host (as I know you know since you developed it).

Thus far I have tried the following things.

I have successfully been able to pass in a full GPU via physical passthru and MIG to a container while having the NVIDIA datacenter drivers installed and NVIDIA runtime param set. As mentioned, this does not meet the objective.

I have not been successful in passing a full GPU via physical passthru to a VM while having the NVIDIA datacenter drivers installed. The VM hangs on start for what I am guessing is because the NVIDIA driver has attached itself to the card (nothing ever makes it to the KVM log).

I have not been successful in passing a full GPU via physical passthru to a VM while having no drivers installed (No NVIDIA drivers, noveau blacklisted). The VM will start and I am able to see the device passed through via lspci. After driver install in the VM and reboot, the driver never attaches to the device (nvidia-smi complains of no /dev/nvidia existing).

I have not been successful in passing a full GPU via physical passthru to a VM while having vGPU host drivers install. The VM will start and I am able to see the device passed through via lspci. After vGPU guest driver install in the VM and reboot, the driver never attaches to the device (nvidia-smi complains of no /dev/nvidia existing).

My next avenue was to enable SR-IOV on the cards (as you did in the video) which didn’t work.

./sriov-manage -e ALL
Enabling VFs on 0000:07:00.0
./sriov-manage: line 184: echo: write error: Input/output error

dmesg
[  132.228583] NVRM: GPU 0000:07:00.0: UnbindLock acquired
[  132.520450] pci-pf-stub 0000:07:00.0: claimed by pci-pf-stub
[  133.130411] pci 0000:07:00.4: [10de:20b2] type 7f class 0xffffff
[  133.135325] pci 0000:07:00.4: unknown header type 7f, ignoring device
[  134.178116] NVRM: GPU at 0000:07:00.0 has software scheduler DISABLED with policy BEST_EFFORT.

I am working with NVIDIA to try and sort things out but they have told me that they do not support LXD via vGPU. I would really like to get this using utilizing LXD.

I was hoping someone might have already fought this battle and won but have not been able to find anything (hence me posting here).

I would be more than happy to give you access to my node if there is any value on your side to further developing LXD to support this use case (assuming anything could or would need to be done on the LXD side and its not on NVIDIA’s end).

stgraber · March 29, 2023, 4:36am

What happens if you use a simple pci device to pass through the full GPU with no driver installed on the host? That would avoid the issue of having the host driver mess with it and it should let you do it without having to run the GRID stack.

If using GRID, the question is whether they have an mdev profile which is an entire card, if they do, then passing that to a guest using gputype=mdev should work fine but you’ll need the DC guest drivers in that VM.

cterror · March 29, 2023, 7:54pm

Just tested using a pci device with same results as a gpu type physical device. The device shows up in the VM via lspci but the driver wont load.

I just found the blurb in the vGPU docs that state that when using pass thru, any connected GPU’s via NVSwitch/NVLink have to be passed into the same VM. In the case of the HGX, all 8 GPU’s are connected.

It seems with the A100, in order to create an mdev, it requires enablement of SR-IOV which is of course where I am running into issues. I have an inquiry out to NVIDIA about said issue but I suspect its related to the NVSwitch/NVLink stuff associated with this platform.

Will update once I have more info.

stgraber · March 29, 2023, 8:07pm

Okay, definitely sounds like them all being connected over NVLink is causing more issues than good here.

Is there any way to disable that stuff so they’re effectively all standalone rather than inter-connected? With the PCIe SKUs, you could just not connect the NVLINK bridges, but I guess with HGX it’s not quite that simple, though maybe there’s some kind of firmware config to turn off that part?

cterror · April 17, 2023, 8:54pm

Apologize for the delay in responding as I have been busy trying to get things working.

We found out we had a bit of a red herring with everything we had been doing. We starting going the vGPU route due to the hang issues we were seeing using physical pass through with or without the datacenter drivers assuming it was related to the capability of the GPU to be passed through.

Through a lot of testing we figured out the following.

As long as no drivers/modules are bound to the GPU’s (or you force binding to vfio-pci), ACPI is enabled, and each GPU is in an isolated IOMMU group, the GPU will physically pass into the VM

We identified the source of the hang we were experiencing…

Starting a VM with a passed through GPU and large amount of ram (128GB+) takes about 30 seconds to 2 minutes for the lxc start command to return
- Does not happen if no limit for RAM is defined OR no GPU passed through
- Only occurs when the combination of a large amount of RAM and a GPU is passed through
We originally thought that the issue was with the VM (QEMU) starting due to not being able to acquire the GPU resource
- As long as fabric manager is not installed (or any NVIDIA drivers for that matter), the NVLink/NVSwitch fabric will not be initialized
- QEMU does actually start albeit after quite a delay (30 seconds to 2 minutes as mentioned above)
The hang appear to occur with the LXD daemon
- As long as nothing else is hitting the LXD daemon API while the VM is starting, start command will return and things will function as expected
- If the API is hit during this start window, it appears to cause the LXD daemon to hang
Our original deployment involved juju pushing out lxd, prometheus and grafana charms
- We discovered that prometheus would scape at intervals smaller than the window above resulting in triggering the daemon hang

After removing the prometheus charm, things worked albeit with the caveats mentioned above.

The issue can be reproduced by

Creating a VM with a large amount of RAM and passing in a GPU
Starting the VM
Spamming lxc list a few times after the start command

cterror · April 17, 2023, 8:55pm

For anyone else who is running up against this, below is how we were able to get the A100-SXM4 80G working via LXD on Ubuntu 20.04

Ensure there is nothing that is regularly polling the LXD API (like prometheus)
Enable IOMMU in BIOS
Ensure GPU’s are isolated in their own IOMMU group
Blacklist nouveau
(Optional) Force GPU’s to bind to vfio-pci
Ensure NVIDIA drivers/fabric manager are not installed
Create VM via LXD (confirmed working with Ubuntu 20.04 and Ubuntu 22.04)
- Add GPU device (gputype=physical)
- Add the following config option
  - raw.qemu="-global q35-pcihost.pci-hole64-size=256G"
  - Credit to https://forums.developer.nvidia.com/t/hgx-a100-vm-passthrough-issues-on-ubuntu-20-04/183099/5 for this component
  - Note that the memory size is calculated at 256G per GPU so this would scale out with each VM added into a container
    - i.e. 512GB if two GPU’s are being passed in
Start VM
- Do not run any commands until the lxc start command finishes and returns to prompt
Log into VM
- update grub cmdline default to include pci=realloc=on
- run update-grub and restart VM
Install NVIDIA Datacenter drivers

Note, it is possible to pass in multiple GPU’s and NVSwitches as well. Installing the fabric manager within the VM will initialize the NVLink fabric based upon what is passed in. We have tested this will passing in all 8 GPU’s and all 6 NVSwitches successfully. We are currently researching the various permutations of things in between (as there is a physical topology items to consider). Will update here once we have the results of that.