Nvidia.runtime error on nixos host

trumee · January 12, 2025, 4:43am

Hello,
I am trying to use nvidia gpu in incus container with the nixos as the host. If i try to setup nvidia.runtume: true i get an error,

Config parsing error: Initialize LXC: The NVIDIA LXC hook couldn't be found
Press enter to open the editor again or ctrl+c to abort change

I have enabled hardware.nvidia-container-toolkit.enable in nixos.

My nvidia nix config is as follows:

{ config, pkgs, ... }:
{
  # Nvidia specific
  nixpkgs.config.allowUnfree = true;
  environment.systemPackages = with pkgs; [
    # cudaPackages_12.cudatoolkit
  ];
  # Some programs need SUID wrappers, can be configured further or are



  # REGION NVIDIA / CUDA

  # Enable OpenGL
  hardware.graphics = {
    enable = true;
  };

  hardware.graphics.enable32Bit = true;
  hardware.nvidia-container-toolkit.enable = true;


  # Load nvidia driver for Xorg and Wayland
  services.xserver.videoDrivers = [ "nvidia" ];

  # see https://nixos.wiki/wiki/Nvidia#CUDA_and_using_your_GPU_for_compute
  hardware.nvidia = {
    # Modesetting is required.
    modesetting.enable = true;

    # Nvidia power management. Experimental, and can cause sleep/suspend to fail.
    powerManagement.enable = true;
    powerManagement.finegrained = false;

    open = false;

    # Enable the Nvidia settings menu,
          # accessible via `nvidia-settings`.
    nvidiaSettings = true;

    package = config.boot.kernelPackages.nvidiaPackages.production;
  };
  # ENDREGION

The configuration.nix has the following for incus,

 virtualisation.incus.package = pkgs.incus;
 virtualisation.incus.enable = true;
 networking.nftables.enable = true; 
 
 systemd.services.incus.path = [ pkgs.libnvidia-container ];

Any idea how to fix this?

stgraber · January 12, 2025, 6:12am

@adamcstephens may know

adamcstephens · January 12, 2025, 2:33pm

Is the guest nixos, or what distro?

trumee · January 12, 2025, 3:50pm

The guest is Ubuntu 24.04 LTS and Host NixOS 24.11

adamcstephens · January 12, 2025, 9:06pm

Looks like we may be missing something for supporting this on non-NixOS guests. This seems to avoid the initial error. Can you try it and report back if everything works as expected? I’ll add to the incus module if so.

systemd.services.incus.environment.INCUS_LXC_HOOK = "${config.virtualisation.incus.lxcPackage}/share/lxc/hooks";

Adding libnvidia-container to the path is unnecessary as it is already in the service path.

I’d not seen hardware.nvidia-container-toolkit before. I’d be curious to know if this is required for the incus nvidia integration to work, or not. Mind trying both?

trumee · January 13, 2025, 4:25am

The hook change got me passed the error. Can that be made as a default?

The following failed though,

$incus launch images:ubuntu/24.04 c1
Launching c1

$ incus config device add c1 gpu gpu id=0
Device gpu added to c1

$ incus config set c1 nvidia.driver.capabilities=all nvidia.runtime="true"

$ incus exec c1 -- nvidia-smi
Error: Command not found

I installed nvidia-utils-550 inside the container and then nvidia-smi started to work.

My plan is to run docker inside incus container. For docker to pick the GPU i had to install nvidia-container-toolkit following this.

In addition following things were also required:

fix-gpu-passthrough.service

# cat /etc/systemd/system/fix-gpu-passthrough.service 
[Unit]
Description=Creates Symlink required for LXC/Nvidia to Docker passthrough
Before=docker.service

[Service]
User=root
Group=root
ExecStart=/bin/bash -c 'mkdir -p /proc/driver/nvidia/gpus && ln -s /dev/nvidia0 /proc/driver/nvidia/gpus/0000:02:00.0'
Type=oneshot

[Install]
WantedBy=multi-user.target

Fix /etc/nvidia-container-runtime/config.toml

# cat /etc/nvidia-container-runtime/config.toml
disable-require = false

[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true

[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

With these changes i am able to use GPU in a docker container inside incus container on a nixOS host.

adamcstephens · January 14, 2025, 12:49am

Yeah, I’ll get the hook env var added as default.

The other two files are inside the incus container?

trumee · January 14, 2025, 3:54am

Yes, steps 1 and 2 were inside the container.

What surprised me was that i had to install nvidia-utils-550 inside the container to get nvidia-smi working. I thought nvidia.runtime would expose it automatically, but that wasnt the case here.

trumee · January 21, 2025, 3:47am

@adamcstephens I came across this snippet to use nvidia for LXD on the internet. Can this be modified for incus?

let
  libnvidia-container = pkgs.callPackage "${inputs.nixpkgs}/pkgs/by-name/li/libnvidia-container/package.nix" {};
in {
  systemd.services.lxd = {
    environment = let
      path =
        pkgs.lib.makeBinPath
        (with pkgs; [which libnvidia-container util-linux]);
      hook =
        (
          pkgs.srcOnly {
            name = "lxc-hooks";
            src = "${pkgs.lxc}/share/lxc/hooks";
            nativeBuildInputs = [pkgs.makeWrapper];
          }
        )
        .overrideAttrs (
          oldAttrs: {
            installPhase = ''
              ${oldAttrs.installPhase}
              wrapProgram $out/nvidia --prefix PATH : ${path}
            '';
          }
        );
    in {
      LXD_LXC_HOOK = "${hook}";
    };
  };

  virtualisation.lxd = {
    enable = true;
    ui.enable = true;
    # This turns on a few sysctl settings that the LXD documentation recommends
    # for running in production.
    recommendedSysctlSettings = true;

    package = pkgs.lxd-lts.override {
      lxd-unwrapped-lts = pkgs.lxd-unwrapped-lts.overrideAttrs (
        oldAttrs: {
          postPatch = ''
            ${oldAttrs.postPatch}
            substituteInPlace lxd/instance/drivers/driver_lxc.go \
              --replace "nvidia-container-cli" "${libnvidia-container}/bin/nvidia-container-cli"
          '';
        }
      );
    };
  }

adamcstephens · January 21, 2025, 2:06pm

I’m not seeing how that will improve anything. We’re already putting libnvidia-container in the path for incus.

I thought you got it working, is that not the case?

trumee · January 21, 2025, 4:51pm

Yes, i did get it working.

However, i had to install nvidia-utils-550 inside the container to get nvidia-smi. This video from @stgraber shows that with just nvidia.runtime set we should have nvidia-smi inside the container. So i guess incus on NixOS is not setting it up correctly?

Not sure if this is the reason.

adamcstephens · January 29, 2025, 1:36am

I’m not sure where stgraber got nvidia-smi from, but using the nixos host nvidia-smi isn’t going to work as desired even if you copied it:

✗ ldd $(which nvidia-smi)
	linux-vdso.so.1 (0x00007f394ddcc000)
	libpthread.so.0 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libpthread.so.0 (0x00007f394ddc1000)
	libm.so.6 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libm.so.6 (0x00007f394dcda000)
	libdl.so.2 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libdl.so.2 (0x00007f394dcd5000)
	libc.so.6 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libc.so.6 (0x00007f394dada000)
	librt.so.1 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/librt.so.1 (0x00007f394dad5000)
	/nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/ld-linux-x86-64.so.2 => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib64/ld-linux-x86-64.so.2 (0x00007f394ddce000)

trumee · January 29, 2025, 4:11am

@stgraber How does nvidia-smi gets exposed in the container?

stgraber · January 29, 2025, 3:59pm

It’s bind-mounted into place by the nvidia LXC hook.

trumee · January 29, 2025, 5:16pm

@adamcstephens See @stgraber based above, how can we make the nvidia LXC hook expose nvidia-smi in NixOS?

I copied nvidia-smi from host to the container but it doesnt work even though the dependencies seem to be satisfied.

$ incus file push $(which nvidia-smi) c1/root/
$ incus exec c1 -- ldd /root/nvidia-smi
        linux-vdso.so.1 (0x00007fc327ffe000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc327ff0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc327f07000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc327f02000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc327cee000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc327ce9000)
        /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x00007fc328000000)

$ incus exec c1 -- /root/nvidia-smi
Error: Command not found

adamcstephens · January 29, 2025, 6:28pm

Ahh, I only looked at incus and didn’t see anything about nvidia-smi. We can try patching or wrapping the LXC hook.

Command not found can have multiple meanings. One of which is missing dynamic libraries. If you obtain a shell/exec-bash do you get the same error when trying to run the manually copied nvidia-smi?

trumee · January 29, 2025, 6:45pm

This is what i get,

$ incus exec c1 bash
root@c1:~# /root/nvidia-smi 
bash: /root/nvidia-smi: cannot execute: required file not found

root@c1:~# ldd /root/nvidia-smi 
        linux-vdso.so.1 (0x00007f029438c000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f029437e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0294295000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0294290000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f029407c000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f0294077000)
        /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x00007f029438e000)

root@c1:~# file /root/nvidia-smi 
/root/nvidia-smi: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ee25fbd45a994e3ca42bf4186574808865915235, stripped

adamcstephens · January 29, 2025, 6:50pm

Then I strongly suspect that even if we do fix bind mounting nvidia-smi into the container that it still won’t work. Could be a glibc incompatibility or something. The downside of dynamically linked libraries.

trumee · January 29, 2025, 6:52pm

Is this unique to NixOS? I had no such issue with ArchLinux as the host and Ubuntu as the container.

adamcstephens · January 29, 2025, 6:56pm

As you can see from your ldd where it’s looking for /nix/store..., we do linking differently on NixOS, yes. I’ll still try and fix the hook so it can find nvidia-smi anyway, but I’m not hopeful it’ll do what you want.

But this is only one minor part of the integration. Does the GPU work as intended, besides the missing nvidia-smi cli tool?