I have a test setup with Tesla A100 GPU and I created several MIG devices on it. Now each MIG device has been provided a device ID.
So for example my new MIG device ID is: 0:0 (GPUDeviceIndex:MIGDeviceIndex) and I pass this to the LXD container with the following command: lxc config device add test gpu0 gpu id=0:0
But it fails to detect this ID and shows this error: Error: Failed to start device "gpu0": Failed to detect requested GPU device
@genesis96839 Could you please provide the output of lxc query /1.0/resources | jq '.gpu'? My guess is that id=0:0 is incorrect as this is usually just an integer.
Nvidia tells me I should be receiving one soon but no exact ETA.
Without having some time with one of those cards it’s hard to guess what would be needed to support this feature.
It will also depend on whether libnvidia-container supports it already.
Yeah, sadly the only compatible system I have around is an arm64 box and NVIDIA doesn’t seem to have vGPU drivers for that, so I’m waiting for more hardware to arrive over the coming days, hoping I can arrange a test system here which will be compatible with this card (none of our Xeon based systems seem to like it somehow…).
Oh ok. The last I got my hands on a physical system was with AMD EPYC series CPU with A100s. But recently when I tried LXD for MIG on the A100s, it was on a VM with Intel Xeons. But not sure which generation was it. Let me see if I can get the name of that Xeon CPU.
Looks like the system in question had some issue with 4G memory mappings for GPUs.
Instead I now have the A100 in a consumer grade machine I quickly assembled, Ryzen 6300 on a Asus Pro-WS-X570-ACE. It’s properly detected, nvidia-smi looks happy and I’m just waiting for access to the vGPU drivers to get to validate all the features.
Well I’m also waiting for a bunch of extra fans and fan controller as I’ve got nowhere near enough air flow in that test system to keep both that A100 and the Mellanox Connect-X 6 dual-100Gb card cool
Awesome! Looking forward to the updates on MIG instances.
Oh wow. I never got my hands on a Mellanox card. Specially the 100GB cards.
I believe there would be hardly any latency when using such powerful interconnects for HPC clustering. I am pretty interested in clustering aspect as I am working on a remote node clustering solution and such speedy interconnects are only a dream in that case lol.
I’ve fixed our support for mdev+SR-IOV on the A100 for the VM side of things.
I’ve got a few minor UX fixes I want to sort out for that side of the story before I investigate MIG for containers.