Is there any way to add MIG Tesla A100 GPU device to LXD container?

For now it’s relying on pre-created GPU and compute instances.
In theory we could have LXD handle the creation too by having it create and destroy the instance with you providing the profile names, but it can get a bit messy as one GPU instance can have several compute instances

Thanks a ton @stgraber for the quick response on this. :raised_hands:t2: This looks like the expected behaviour as the MIG GPUs are visible inside the container. Looking great.

I had also created the MIG compute instances initially and then tried adding them but that didn’t help. Could you help me understand what should I specify in the lxc config device add command?

I tried it out like this last time:
lxc config device add test gpu0 gpu id=0:0

Currently you can’t we need a new LXD feature for this, will be in 4.13.

I see. Sure that’d still be great. Is there any expected timeline around 4.13 release?

Currently still scheduled for today, but that won’t happen, looking at next Thursday at this point.

1 Like

Awesome! Looking forward to it. :slight_smile:

Am I correct in understanding from the tread above that I can now pass through a MIG GPU to my LXC container?

lxc config device add equipped-sole gpu gpu id=0:0

This command does not seem to work, and I also tried the MIG-ID to no avail.

Any pointers how to only pass through a singe MIG GPU?

Thanks

MIG relies on gputype=mig and then passing the MIG CI UUID through uuid=UUID.

Hi Stéphane,

thanks for your reply. I am unfortunately still unable to achieve mig GPU passthrough to my Linux container.

I have a launched a Linux container on Ubuntu as follow:

lxc launch -v ubuntu:22.04 -c nvidia.runtime=true

I can see that it is launced and running.

root@****10de001:~# lxc list
+----------------+---------+----------------------+------+-----------+-----------+
|      NAME      |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+----------------+---------+----------------------+------+-----------+-----------+
| relaxed-possum | RUNNING | 172.19.129.46 (eth0) |      | CONTAINER | 0         |
+----------------+---------+----------------------+------+-----------+-----------+

I can also connect to it.

Next I look up my gpu’s and can see mig is successfully enable on GPU 0.

root@*****10de001:~# nvidia-smi
Tue Dec 13 17:37:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:17:00.0 Off |                   On |
| N/A   39C    P0    47W / 300W |     38MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   41C    P0    47W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   39C    P0    55W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:E3:00.0 Off |                    0 |
| N/A   43C    P0    56W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    1   0   0  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    2   0   1  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I then look up the UUID’s to pass them to the container.

root@****10de001:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-db63bb91-8329-7702-ddb3-a95f8a91c9ea)
  MIG 3g.40gb     Device  0: (UUID: MIG-3c752391-2e63-5240-a203-f905e007b241)
  MIG 3g.40gb     Device  1: (UUID: MIG-761d5795-2aad-5c33-b6e1-e2cd13e9e31d)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-2176dc29-3c20-900f-fc05-fbde542a7966)
GPU 2: NVIDIA A100 80GB PCIe (UUID: GPU-ada7f79e-6d7e-b1cf-3f6b-24f0270b5ea4)
GPU 3: NVIDIA A100 80GB PCIe (UUID: GPU-26759b5e-33fa-e909-55a2-9486ec030e34)

the last step is where I get stuck and I have tried a variety of formats here, I keep on getting the error messages.

root@*****10de001:~# lxc config device add relaxed-possum gpu gputype=mig uuid=MIG-3c752391-2e63-5240-a203-f905e007b241
Error: Invalid devices: Device validation failed for "gpu": Failed loading device "gpu": Unsupported device type

root@teeis10de001:~# lxc config device add relaxed-possum gputype=mig uuid=3c752391-2e63-5240-a203-f905e007b241
Error: Invalid devices: Device validation failed for "gputype=mig": Name can only contain alphanumeric, forward slash, hyphen, colon, underscore and full stop characters
root@teeis10de001:~# lxc config device add relaxed-possum gpu gpu gputype=mig uuid=3c752391-2e63-5240-a203-f905e007b241
Error: Invalid devices: Device validation failed for "gpu": Invalid device option "uuid"
root@teeis10de001:~# lxc config device add relaxed-possum gpu gpu gputype=mig uuid=MIG-3c752391-2e63-5240-a203-f905e007b241
Error: Invalid devices: Device validation failed for "gpu": Invalid device option "uuid"

Can you please clarify the formatting and order for ‘‘lxc config device add’’.

Thanks

So:

lxc config device add INSTANCE-NAME DEVICE-NAME gpu gputype=mig mig.uuid=UUID
2 Likes

Hi Stéphane,

thanks for the document reference and the config device clarification, that worked to add the config to my container instance.

root@*****10de001:~# lxc config device add noble-pug dev-gpu gpu gputype=mig mig.uuid=3c752391-2e63-5240-a203-f905e007b241
Error: Failed to add device "dev-gpu": Device cannot be added when instance is running

So attempting to add the mig gpu error with a meaningful message that prompted me to terminate the instance. This is not the case with a standard GPU.

root@*****10de001:~# lxc stop noble-pug
root@*****10de001:~# lxc config device add noble-pug dev-gpu gpu gputype=mig mig.uuid=3c752391-2e63-5240-a203-f905e007b241
Device dev-gpu added to noble-pug

The command to add the device now works as expected and I do get a positive feedback message that my device was successfully added.

My issue now is that once I have added the device successfully I cannot start by container instance and I get the following error message:

root@teeis10de001:~# lxc start noble-pug
Error: Failed to start device "dev-gpu": Card isn't a NVIDIA GPU or driver isn't properly setup
Try `lxc info --show-log noble-pug` for more info

Looking at the log as suggested is not helping much, it seems to be empty and I get no meaningful help searching for solutions online.

root@*****10de001:~# lxc info --show-log noble-pug
Name: noble-pug
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/12/14 11:08 UTC
Last Used: 2022/12/14 11:13 UTC

Log:


root@*****10de001:~#

If a add a standard GPU in this way I do not encounter driver issues.

Are there any other pointers you can provide to facilitate solving to issue please?

Thanks

Jattie

Hi Stéphane,

I managed to get the problem resolved after I found your video tutorial in the documentation.

The short version of the solution was a server reboot was required.

When I enable MIG it shows Enabled* instead of Enabled as in your video. I am not sure what causes the difference and in order to get that resolved I needed to reboot my server to get it to Enbled without the *.

I have now exactly what I was looking for to set up containers for developers with GPU resources exposed and slicing the four GPU’s into smaller chunks.

I did notice that a server reboot loses my MIG’s and I need to re-create them again and probably outside of the scope of LXC/LXD, however are there recommended best practices to achieve mig persistence on server reboots?

Yeah, it’s a bit odd that some bits of MIG are persistent and some others are not.

It’s also not helping that nvidia-smi isn’t the easiest thing to script.
Initially I was hoping to have LXD be able to pick up a CI/GI based on something a bit more flexible than a UUID, but I couldn’t find a good way to do this without having to do a lot of very fragile parsing.

Hi @stgraber I need some help with this. I need to pass MIG GPUs inside an lxd container with security privileges set to True. But MIG requires nvidia.runtime. And nvidia.runtime doesn’t work with security privileged container.

This is the output I am getting when trying to add MIG GPUs to a privileged container:
Error: Failed to start device "gpu0": nvidia.runtime must be set to true for MIG GPUs to work

And when I try to add nvidia.runtime to the same privileged container, I get this:
Error: Invalid config: nvidia.runtime is incompatible with privileged containers