GPU resources monitoring for LXD containers

tt-leader · July 30, 2019, 3:12am

I will use lxd containers to share my 8-gpu-cards server resources. Each user will have a container, which all gpu cards are mounted. As the server administrator, I want to monitor who are using the most gpu resources. I find the following ways to solve it, but have some problems.

Use nvidia-smi in the host machine. I can find the pid of process, but the host pid and container pid mapping is not direct. I cannot convert them in a formula.
Find username of the process. But those process’s username are all 296608, different container’s process in the host machine have the same user property.
Use lxc info container_name. The results list CPU, Memory and Network usage, and in the LXD 3.12, GPU resource are also included. However, GPU item only have some basic info, no GPU memory, Temp, or GPU-Util.

Card 0:
    Vendor: NVIDIA Corporation (10de)
    Product: GK208B [GeForce GT 730] (1287)
    PCI address: 0000:00:07.0
    Driver: nvidia (418.56)
    NUMA node: 0
    NVIDIA information:
      Architecture: 3.5
      Brand: GeForce
      Model: GeForce GT 730
      CUDA Version: 10.1
      NVRM Version: 418.56
      UUID: GPU-6ddadebd-dafe-2db9-f10f-125719770fd3

Any suggestions? Thanks in advance.

stgraber · July 30, 2019, 3:16am

Indeed, because of the way NVIDIA integrates with the Linux kernel, they don’t have access to much information that’s restricted to GPL code, including information about namespaces.

You mention being able to look things up by uid, so a solution may be to use security.idmap.isolated=true to have a separate range of uid/gid per container, making it possible to track down the container.

tt-leader · July 30, 2019, 3:52am

Thanks for quick response.

I update the security.idmap.isolated=true and create two containers. Now, they have different uids, like 362144 = 296608+65536 and 427680 = 362144 + 65536.

But how can I mapping the uid range to containers? The only info I guess, the container firstly created has smaller uid, and later containers has bigger uids.

stgraber · July 30, 2019, 12:35pm

Using lxc config show you can get idmap information for the container, this will tell you what the base uid is on the host and how many uids after that are part of the container.