LXD and NVIDIA HGX A100

For anyone else who is running up against this, below is how we were able to get the A100-SXM4 80G working via LXD on Ubuntu 20.04

  • Ensure there is nothing that is regularly polling the LXD API (like prometheus)
  • Enable IOMMU in BIOS
  • Ensure GPU’s are isolated in their own IOMMU group
  • Blacklist nouveau
  • (Optional) Force GPU’s to bind to vfio-pci
  • Ensure NVIDIA drivers/fabric manager are not installed
  • Create VM via LXD (confirmed working with Ubuntu 20.04 and Ubuntu 22.04)
  • Start VM
    • Do not run any commands until the lxc start command finishes and returns to prompt
  • Log into VM
    • update grub cmdline default to include pci=realloc=on
    • run update-grub and restart VM
  • Install NVIDIA Datacenter drivers

Note, it is possible to pass in multiple GPU’s and NVSwitches as well. Installing the fabric manager within the VM will initialize the NVLink fabric based upon what is passed in. We have tested this will passing in all 8 GPU’s and all 6 NVSwitches successfully. We are currently researching the various permutations of things in between (as there is a physical topology items to consider). Will update here once we have the results of that.

1 Like