Nvidia hook not working with OpenSUSE leap 15.6

javiertoledos · September 24, 2024, 4:16am

Hello,

I was trying to make nvidia runtime to work with Incus 6.5 (built from source in OpenSUSE Leap 15.6). However I encountered few issues and I couldn´t find what’s wrong:

First, Incus searches for the hooks that are installed in the LXCpackage so as an additional prerequisite/requisite for using nvidia runtime in OpenSUSE is to install lxc. Otherwise you get The NVIDIA LXC hook couldn't be found when trying to set nvidia.runtime=true config.

After creating an instance (using same OS image for the container), I added the GPU

incus init images:opensuse/15.6/cloud --profile default test
incus config device add test nvidia-gpu gpu pci="01:00.0" gputype=physical
incus start test

So far so good, I can see my nvidia gpu in the container:

# ls -lh /dev

crw-rw---- 1 root  root  195, 254 Sep 24 03:52 nvidia-modeset
crw-rw-rw- 1 root  root  510,   0 Sep 24 03:52 nvidia-uvm
crw-rw-rw- 1 root  root  510,   1 Sep 24 03:52 nvidia-uvm-tools
crw-rw---- 1 root  root  195,   0 Sep 24 03:52 nvidia0
crw-rw---- 1 root  root  195, 255 Sep 24 03:52 nvidiactl

However, when I try to enable nvidia runtime, I start to get errors:

incus stop test
incus config set test nvidia.runtime=true
incus start test

Then I get the following error:

Error: Failed to run: /opt/go/bin/incusd forkstart test /var/lib/incus/containers /run/incus/test/lxc.conf: exit status 1
Try `incus info --show-log test` for more info

When I see logs for the container I get:

lxc test 20240924035525.945 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxcfs/lxc.mount.hook" for container "test"
lxc test 20240924035525.955 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxc/hooks/nvidia" for container "test"
lxc test 20240924035526.439 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc test 20240924035526.444 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3940 - Failed to run mount hooks
lxc test 20240924035526.445 ERROR    start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "test"
lxc test 20240924035526.449 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc test 20240924035526.118 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3673 - Failed to rename interface with index 0 from "eth0" to its initial name "vethe882e371"
lxc test 20240924035526.120 ERROR    start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "test"
lxc test 20240924035526.120 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:838 - Received container state "ABORTING" instead of "RUNNING"
lxc test 20240924035526.120 WARN     start - ../src/lxc/start.c:lxc_abort:1038 - No such process - Failed to send SIGKILL via pidfd 17 for process 1325

Based in the logs it appears that the hook for nvidia exited with status 1. I modified the hook to output debug in /tmp/nvidia.log and I got a cryptic message without much information:

-- WARNING, the following logs are for debugging purposes only --

I0924 03:55:25.997466 4 nvc.c:393] initializing library context (version=1.16.1, build=4c2494f16573b585788a42e9c7bee76ecd48c73d)
I0924 03:55:25.997530 4 nvc.c:364] using root /
I0924 03:55:25.997544 4 nvc.c:365] using ldcache /etc/ld.so.cache
I0924 03:55:25.997557 4 nvc.c:366] using unprivileged user 0:0
I0924 03:55:25.997586 4 nvc.c:410] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0924 03:55:25.997669 4 nvc.c:412] dxcore initialization failed, continuing assuming a non-WSL environment
I0924 03:55:25.998002 21 rpc.c:71] starting driver rpc service
I0924 03:55:26.004017 4 rpc.c:135] driver rpc service terminated with signal 15
I0924 03:55:26.004082 4 nvc.c:452] shutting down library context

It seems that something goes wrong when trying to invoke nvidia-container-cli but it’s hard to tell from the error what could be the issue. I tried switching drivers, and so far, the hook doesn´t work with closed source drivers, open source drivers, different driver versions (production branch 550 or feature branch 560 drivers) or different versions of the toolkit (1.11, 1.16) or cuda drivers.

nvidia-smi command works well in the host

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   51C    P8             13W /  320W |      84MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

And running hwinfo --gfxcard shows that nvidia drivers are loaded:

16: PCI 100.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.386]
  Unique ID: VCu0.zw1o7zH5Op4
  Parent ID: vSkL.zN_Ib5B+QN9
  SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0
  SysFS BusID: 0000:01:00.0
  Hardware Class: graphics card
  Model: "nVidia VGA compatible controller"
  Vendor: pci 0x10de "nVidia Corporation"
  Device: pci 0x2704 
  SubVendor: pci 0x1043 "ASUSTeK Computer Inc."
  SubDevice: pci 0x8900 
  Revision: 0xa1
  Driver: "nvidia"
  Driver Modules: "nvidia"
  Memory Range: 0x41000000-0x41ffffff (rw,non-prefetchable)
  Memory Range: 0x6000000000-0x63ffffffff (ro,non-prefetchable)
  Memory Range: 0x6400000000-0x6401ffffff (ro,non-prefetchable)
  I/O Ports: 0x3000-0x3fff (rw)
  Memory Range: 0x42000000-0x4207ffff (ro,non-prefetchable,disabled)
  IRQ: 191 (136713 events)
  Module Alias: "pci:v000010DEd00002704sv00001043sd00008900bc03sc00i00"
  Driver Info #0:
    Driver Status: nouveau is not active
    Driver Activation Cmd: "modprobe nouveau"
  Driver Info #1:
    Driver Status: nvidia_drm is active
    Driver Activation Cmd: "modprobe nvidia_drm"
  Driver Info #2:
    Driver Status: nvidia is active
    Driver Activation Cmd: "modprobe nvidia"
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #13 (PCI bridge)

The only additional relevant detail about my OpenSUSE install is that I selected SELinux instead of AppArmor in the installer.

Thinking it could be a driver issue or something wrong with my card/install, I tried to use the Nvidia container toolkit with podman, or use the cuda feature directly by running ollama and I had no trouble using my GPU for LLM either directly or using Container device interface (CDI) with podman. Which makes me wonder if it’s

A bug in the nvidia-container-toolkit that fails with Incus but succeeds with Podman
A misconfiguration/bug on how Incus invokes the nvidia hook in OpenSUSE

Any ideas would be greatly appreciated.

Edit: Created an issue in the nvidia-container-toolkit project, but still wonder if there’s something to be considered on Incus side. LXC hook doesn´t seem to work on OpenSUSE Leap · Issue #711 · NVIDIA/nvidia-container-toolkit · GitHub