Nvidia GPU passthrough not working on Gentoo host after running "sudo lxc config set Container-Name nvidia.driver.capabilities=all"

Sol33t303 · December 31, 2021, 9:43am

I’m following the following guide which seems very straight forward from the lead dev LXD and NVIDIA GPUs - YouTube

The problem occurs after running “sudo lxc config set container-name nvidia.runtime=true”, “sudo lxc config device add container-name gpu gpu gputype=physical” seems to work just fine.

After I set nvidia.runtime=true, every container I try (of which I have tried Arch, Debian, Ubuntu and OpenRC Gentoo containers, to get a bit of a variety of distros) fail to start with the following in their logs:

Name: Gentoo-Nvidia
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/12/31 02:41 AEDT
Status: Stopped
Type: container
Profiles: default

Log:

lxc Gentoo-Nvidia 20211230170057.516 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1251 - No such file or directory - Failed to fchownat(44, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc Gentoo-Nvidia 20211230170057.670 ERROR    conf - conf.c:run_buffer:321 - Script exited with status 1
lxc Gentoo-Nvidia 20211230170057.670 ERROR    conf - conf.c:lxc_setup:4381 - Failed to run mount hooks
lxc Gentoo-Nvidia 20211230170057.670 ERROR    start - start.c:do_start:1275 - Failed to setup container "Gentoo-Nvidia"
lxc Gentoo-Nvidia 20211230170057.670 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc Gentoo-Nvidia 20211230170057.673 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "vethdc487900"
lxc Gentoo-Nvidia 20211230170057.673 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:867 - Received container state "ABORTING" instead of "RUNNING"
lxc Gentoo-Nvidia 20211230170057.673 ERROR    start - start.c:__lxc_start:2069 - Failed to spawn container "Gentoo-Nvidia"
lxc Gentoo-Nvidia 20211230170057.673 WARN     start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 45 for process 22526
lxc Gentoo-Nvidia 20211230170102.775 WARN     cgfsng - cgroups/cgfsng.c:cgroup_tree_remove:483 - No such file or directory - Failed to destroy 23(lxc.payload.Gentoo-Nvidia)
lxc 20211230170102.803 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20211230170102.803 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors

Given the problem has occured with every distro, the problems obviously with my host somehow. But i’m unsure on how to troubleshoot this. I have libnvidia-container and nvidia-container-toolkit installed from the guru overlay. What else should I be doing? I can’t find any gentoo documentation on this.

stgraber · December 31, 2021, 3:53pm

Make sure that your container isn’t privileged, you can’t currently combine security.privileged=true with nvidia.runtime=true. If that’s good, then make sure that nvidia-smi works properly on your host system, if it does, then check if nvidia-container-cli info also works properly.

Sol33t303 · December 31, 2021, 4:40pm

I tried starting the containers both privileged and unprivileged. Starting privliged removed the lxc Gentoo-Nvidia 20211230170057.516 WARN cgfsng - cgroups/cgfsng.c:fchowmodat:1251 - No such file or directory - Failed to fchownat(44, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW ) line in the logs, so I figured that was progress, but in that case, I’ll switch the containers back to unprivileged.

nvidia-smi works fine with the following output:

Sat Jan  1 03:34:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
|  0%   57C    P0    67W / 250W |    258MiB / 11175MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:0C:00.0 N/A |                  N/A |
| 30%   34C    P8    N/A /  N/A |     10MiB /  2000MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:0D:00.0 N/A |                  N/A |
| 30%   31C    P8    N/A /  N/A |     10MiB /  2000MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3048      G   /usr/bin/X                         28MiB |
|    0   N/A  N/A      3160      G   X                                 193MiB |
|    0   N/A  N/A      3431      G   picom                              29MiB |
|    0   N/A  N/A      3499      G   kitty                               3MiB |
+-----------------------------------------------------------------------------+

I’m trying to pass in the second 690 to the container, it’s one of those old Nvidia GPUs that had two GPUs on the same card that was meant to be used for SLI.

nvidia-container-cli also appears to be working fine, although strangely has permission errors when run with sudo but runs fine as my user:

NVRM version:   470.94
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1080 Ti
Brand:          GeForce
GPU UUID:       GPU-697d9489-ad9f-585e-3b5e-6d5a7e864f8a
Bus Location:   00000000:09:00.0
Architecture:   6.1

Device Index:   1
Device Minor:   1
Model:          NVIDIA GeForce GTX 690
Brand:          GeForce
GPU UUID:       GPU-215a709b-ab16-3406-e07f-358df47167ed
Bus Location:   00000000:0c:00.0
Architecture:   3.0

Device Index:   2
Device Minor:   2
Model:          NVIDIA GeForce GTX 690
Brand:          GeForce
GPU UUID:       GPU-c11b18a9-dacd-d424-5418-0d97f0bf3525
Bus Location:   00000000:0d:00.0
Architecture:   3.0

stgraber · December 31, 2021, 5:11pm

Ah yeah, we had someone else have the same issue where it wouldn’t run as root and since LXD does run it as root, that was causing the problem.

You’ll have to do a bit of digging to figure out why root cannot interact with the GPU. Once you get nvidia-container-cli info to run as root, LXD should similarly work fine.

Sol33t303 · December 31, 2021, 5:22pm

Thank you for the help, now that I know that, that should give me enough of a direction to look into to figure out a solution. I’ll try to remember to add the solution here if/when I figure it out