For anyone else who is running up against this, below is how we were able to get the A100-SXM4 80G working via LXD on Ubuntu 20.04
- Ensure there is nothing that is regularly polling the LXD API (like prometheus)
- Enable IOMMU in BIOS
- Ensure GPU’s are isolated in their own IOMMU group
- Blacklist nouveau
- (Optional) Force GPU’s to bind to vfio-pci
- Ensure NVIDIA drivers/fabric manager are not installed
- Create VM via LXD (confirmed working with Ubuntu 20.04 and Ubuntu 22.04)
- Add GPU device (gputype=physical)
- Add the following config option
- raw.qemu="-global q35-pcihost.pci-hole64-size=256G"
- Credit to https://forums.developer.nvidia.com/t/hgx-a100-vm-passthrough-issues-on-ubuntu-20-04/183099/5 for this component
- Note that the memory size is calculated at 256G per GPU so this would scale out with each VM added into a container
- i.e. 512GB if two GPU’s are being passed in
- Start VM
- Do not run any commands until the
lxc start
command finishes and returns to prompt
- Do not run any commands until the
- Log into VM
- update grub cmdline default to include
pci=realloc=on
- run
update-grub
and restart VM
- update grub cmdline default to include
- Install NVIDIA Datacenter drivers
Note, it is possible to pass in multiple GPU’s and NVSwitches as well. Installing the fabric manager within the VM will initialize the NVLink fabric based upon what is passed in. We have tested this will passing in all 8 GPU’s and all 6 NVSwitches successfully. We are currently researching the various permutations of things in between (as there is a physical topology items to consider). Will update here once we have the results of that.