The problem occurs after running “sudo lxc config set container-name nvidia.runtime=true”, “sudo lxc config device add container-name gpu gpu gputype=physical” seems to work just fine.
After I set nvidia.runtime=true, every container I try (of which I have tried Arch, Debian, Ubuntu and OpenRC Gentoo containers, to get a bit of a variety of distros) fail to start with the following in their logs:
Name: Gentoo-Nvidia
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/12/31 02:41 AEDT
Status: Stopped
Type: container
Profiles: default
Log:
lxc Gentoo-Nvidia 20211230170057.516 WARN cgfsng - cgroups/cgfsng.c:fchowmodat:1251 - No such file or directory - Failed to fchownat(44, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc Gentoo-Nvidia 20211230170057.670 ERROR conf - conf.c:run_buffer:321 - Script exited with status 1
lxc Gentoo-Nvidia 20211230170057.670 ERROR conf - conf.c:lxc_setup:4381 - Failed to run mount hooks
lxc Gentoo-Nvidia 20211230170057.670 ERROR start - start.c:do_start:1275 - Failed to setup container "Gentoo-Nvidia"
lxc Gentoo-Nvidia 20211230170057.670 ERROR sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc Gentoo-Nvidia 20211230170057.673 WARN network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "vethdc487900"
lxc Gentoo-Nvidia 20211230170057.673 ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:867 - Received container state "ABORTING" instead of "RUNNING"
lxc Gentoo-Nvidia 20211230170057.673 ERROR start - start.c:__lxc_start:2069 - Failed to spawn container "Gentoo-Nvidia"
lxc Gentoo-Nvidia 20211230170057.673 WARN start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 45 for process 22526
lxc Gentoo-Nvidia 20211230170102.775 WARN cgfsng - cgroups/cgfsng.c:cgroup_tree_remove:483 - No such file or directory - Failed to destroy 23(lxc.payload.Gentoo-Nvidia)
lxc 20211230170102.803 ERROR af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20211230170102.803 ERROR commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors
Given the problem has occured with every distro, the problems obviously with my host somehow. But i’m unsure on how to troubleshoot this. I have libnvidia-container and nvidia-container-toolkit installed from the guru overlay. What else should I be doing? I can’t find any gentoo documentation on this.
Make sure that your container isn’t privileged, you can’t currently combine security.privileged=true with nvidia.runtime=true. If that’s good, then make sure that nvidia-smi works properly on your host system, if it does, then check if nvidia-container-cli info also works properly.
I tried starting the containers both privileged and unprivileged. Starting privliged removed the lxc Gentoo-Nvidia 20211230170057.516 WARN cgfsng - cgroups/cgfsng.c:fchowmodat:1251 - No such file or directory - Failed to fchownat(44, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW ) line in the logs, so I figured that was progress, but in that case, I’ll switch the containers back to unprivileged.
nvidia-smi works fine with the following output:
Sat Jan 1 03:34:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94 Driver Version: 470.94 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:09:00.0 On | N/A |
| 0% 57C P0 67W / 250W | 258MiB / 11175MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:0C:00.0 N/A | N/A |
| 30% 34C P8 N/A / N/A | 10MiB / 2000MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:0D:00.0 N/A | N/A |
| 30% 31C P8 N/A / N/A | 10MiB / 2000MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3048 G /usr/bin/X 28MiB |
| 0 N/A N/A 3160 G X 193MiB |
| 0 N/A N/A 3431 G picom 29MiB |
| 0 N/A N/A 3499 G kitty 3MiB |
+-----------------------------------------------------------------------------+
I’m trying to pass in the second 690 to the container, it’s one of those old Nvidia GPUs that had two GPUs on the same card that was meant to be used for SLI.
nvidia-container-cli also appears to be working fine, although strangely has permission errors when run with sudo but runs fine as my user:
Ah yeah, we had someone else have the same issue where it wouldn’t run as root and since LXD does run it as root, that was causing the problem.
You’ll have to do a bit of digging to figure out why root cannot interact with the GPU. Once you get nvidia-container-cli info to run as root, LXD should similarly work fine.
Thank you for the help, now that I know that, that should give me enough of a direction to look into to figure out a solution. I’ll try to remember to add the solution here if/when I figure it out