Hello,
I was trying to make nvidia runtime to work with Incus 6.5 (built from source in OpenSUSE Leap 15.6). However I encountered few issues and I couldn´t find what’s wrong:
First, Incus searches for the hooks that are installed in the LXCpackage so as an additional prerequisite/requisite for using nvidia runtime in OpenSUSE is to install lxc
. Otherwise you get The NVIDIA LXC hook couldn't be found
when trying to set nvidia.runtime=true
config.
After creating an instance (using same OS image for the container), I added the GPU
incus init images:opensuse/15.6/cloud --profile default test
incus config device add test nvidia-gpu gpu pci="01:00.0" gputype=physical
incus start test
So far so good, I can see my nvidia gpu in the container:
# ls -lh /dev
crw-rw---- 1 root root 195, 254 Sep 24 03:52 nvidia-modeset
crw-rw-rw- 1 root root 510, 0 Sep 24 03:52 nvidia-uvm
crw-rw-rw- 1 root root 510, 1 Sep 24 03:52 nvidia-uvm-tools
crw-rw---- 1 root root 195, 0 Sep 24 03:52 nvidia0
crw-rw---- 1 root root 195, 255 Sep 24 03:52 nvidiactl
However, when I try to enable nvidia runtime, I start to get errors:
incus stop test
incus config set test nvidia.runtime=true
incus start test
Then I get the following error:
Error: Failed to run: /opt/go/bin/incusd forkstart test /var/lib/incus/containers /run/incus/test/lxc.conf: exit status 1
Try `incus info --show-log test` for more info
When I see logs for the container I get:
lxc test 20240924035525.945 INFO utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxcfs/lxc.mount.hook" for container "test"
lxc test 20240924035525.955 INFO utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxc/hooks/nvidia" for container "test"
lxc test 20240924035526.439 ERROR utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc test 20240924035526.444 ERROR conf - ../src/lxc/conf.c:lxc_setup:3940 - Failed to run mount hooks
lxc test 20240924035526.445 ERROR start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "test"
lxc test 20240924035526.449 ERROR sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc test 20240924035526.118 WARN network - ../src/lxc/network.c:lxc_delete_network_priv:3673 - Failed to rename interface with index 0 from "eth0" to its initial name "vethe882e371"
lxc test 20240924035526.120 ERROR start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "test"
lxc test 20240924035526.120 ERROR lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:838 - Received container state "ABORTING" instead of "RUNNING"
lxc test 20240924035526.120 WARN start - ../src/lxc/start.c:lxc_abort:1038 - No such process - Failed to send SIGKILL via pidfd 17 for process 1325
Based in the logs it appears that the hook for nvidia exited with status 1. I modified the hook to output debug in /tmp/nvidia.log and I got a cryptic message without much information:
-- WARNING, the following logs are for debugging purposes only --
I0924 03:55:25.997466 4 nvc.c:393] initializing library context (version=1.16.1, build=4c2494f16573b585788a42e9c7bee76ecd48c73d)
I0924 03:55:25.997530 4 nvc.c:364] using root /
I0924 03:55:25.997544 4 nvc.c:365] using ldcache /etc/ld.so.cache
I0924 03:55:25.997557 4 nvc.c:366] using unprivileged user 0:0
I0924 03:55:25.997586 4 nvc.c:410] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0924 03:55:25.997669 4 nvc.c:412] dxcore initialization failed, continuing assuming a non-WSL environment
I0924 03:55:25.998002 21 rpc.c:71] starting driver rpc service
I0924 03:55:26.004017 4 rpc.c:135] driver rpc service terminated with signal 15
I0924 03:55:26.004082 4 nvc.c:452] shutting down library context
It seems that something goes wrong when trying to invoke nvidia-container-cli
but it’s hard to tell from the error what could be the issue. I tried switching drivers, and so far, the hook doesn´t work with closed source drivers, open source drivers, different driver versions (production branch 550 or feature branch 560 drivers) or different versions of the toolkit (1.11, 1.16) or cuda drivers.
nvidia-smi
command works well in the host
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4080 On | 00000000:01:00.0 Off | N/A |
| 0% 51C P8 13W / 320W | 84MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
And running hwinfo --gfxcard
shows that nvidia drivers are loaded:
16: PCI 100.0: 0300 VGA compatible controller (VGA)
[Created at pci.386]
Unique ID: VCu0.zw1o7zH5Op4
Parent ID: vSkL.zN_Ib5B+QN9
SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0
SysFS BusID: 0000:01:00.0
Hardware Class: graphics card
Model: "nVidia VGA compatible controller"
Vendor: pci 0x10de "nVidia Corporation"
Device: pci 0x2704
SubVendor: pci 0x1043 "ASUSTeK Computer Inc."
SubDevice: pci 0x8900
Revision: 0xa1
Driver: "nvidia"
Driver Modules: "nvidia"
Memory Range: 0x41000000-0x41ffffff (rw,non-prefetchable)
Memory Range: 0x6000000000-0x63ffffffff (ro,non-prefetchable)
Memory Range: 0x6400000000-0x6401ffffff (ro,non-prefetchable)
I/O Ports: 0x3000-0x3fff (rw)
Memory Range: 0x42000000-0x4207ffff (ro,non-prefetchable,disabled)
IRQ: 191 (136713 events)
Module Alias: "pci:v000010DEd00002704sv00001043sd00008900bc03sc00i00"
Driver Info #0:
Driver Status: nouveau is not active
Driver Activation Cmd: "modprobe nouveau"
Driver Info #1:
Driver Status: nvidia_drm is active
Driver Activation Cmd: "modprobe nvidia_drm"
Driver Info #2:
Driver Status: nvidia is active
Driver Activation Cmd: "modprobe nvidia"
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #13 (PCI bridge)
The only additional relevant detail about my OpenSUSE install is that I selected SELinux instead of AppArmor in the installer.
Thinking it could be a driver issue or something wrong with my card/install, I tried to use the Nvidia container toolkit with podman, or use the cuda feature directly by running ollama and I had no trouble using my GPU for LLM either directly or using Container device interface (CDI) with podman. Which makes me wonder if it’s
- A bug in the nvidia-container-toolkit that fails with Incus but succeeds with Podman
- A misconfiguration/bug on how Incus invokes the nvidia hook in OpenSUSE
Any ideas would be greatly appreciated.
Edit: Created an issue in the nvidia-container-toolkit project, but still wonder if there’s something to be considered on Incus side. LXC hook doesn´t seem to work on OpenSUSE Leap · Issue #711 · NVIDIA/nvidia-container-toolkit · GitHub