Nvidia.runtime error on nixos host

Great. The GPU works fine after following the steps mentioned previously.

1 Like

libnvidia-container hardcodes an expectation that nvidia-smi is in /usr/bin, which is not a valid NixOS assumption. There have been some tweaks in our libnvidia-container package recently, but I’m not sure yet if they’ll fix the problem of libnvidia-container failing to find the binaries. I’ll check back in when I can confirm either way, but the changes made will unlikely be backported to stable 24.11.

@stgraber one thing I noticed when debugging this is that the nvidia hook was failing to create /var/lib/incus/storage-pools/default/containers/noble-molly/hook, when I created it and made the permissions wide open, I notice that the hook is running as the container’s root UID and not the host’s. This prevents libnvidia-container from writing its log file.

Have you seen this before?

I suspect that’s normal, the hook was written for LXC and so expects a path like /var/lib/lxc/NAME where it can have write access.

Under Incus we’ve tightened permissions a fair bit more so that’s causing this issue.
Is that fatal though or just prevents logging?

It only prevents logging from what I’ve seen. The logging was helpful for some of the troubleshooting I’m doing, but I can just mkdir/chown during that.

@stgraber @adamcstephens I am trying to setup a container on another host. If i specify nvidia.runtime: "true" container doesnt start.

$incus start dockerblr
Error: Failed to run: /nix/store/2ypj6mwrs14wzwf18avqx0nm5n8r41vg-incus-6.11.0/bin/incusd forkstart dockerblr /var/lib/incus/containers /run/incus/dockerblr/lxc.conf: exit status 1
Try `incus info --show-log dockerblr` for more info

$incus info --show-log dockerblr
Error: Invalid PID 'ļæ½'

My incus is setup as following,

  #incus
  virtualisation.incus.package = pkgs.incus;
  virtualisation.incus.enable = true;
  systemd.services.incus.environment.INCUS_LXC_HOOK =
    "${config.virtualisation.incus.lxcPackage}/share/lxc/hooks";

Once i remove nvidia.runtime the container starts up fine.

Sorry, I isn’t have the bandwidth to look into this further right now. I don’t use this feature and it’s difficult or impossible for us to write NixOS tests for given the hardware requirement.

I’d invite you to file an issue on the nixpkgs repo to track the problem, preferably with any more detail you can provide. Unfortunately, unless you’re willing/able to do the deep investigation yourself, I suspect little progress will be made.

@adamcstephens This issue is back in unstable.

I defined the following container,

architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu noble amd64 (20250829_07:42)
  image.os: Ubuntu
  image.release: noble
  image.requirements.cgroup: v2
  image.serial: "20250829_07:42"
  image.type: squashfs
  image.variant: default
  nvidia.driver.capabilities: all
  nvidia.runtime: "true"
  volatile.base_image: 9e6510296ae2a03601e0eeffaaab0bf990ff13cb15fb328d216fb87b4910936c
  volatile.cloud-init.instance-id: b25968d3-af72-4242-8fa9-f0df01fec03e
  volatile.eth0.hwaddr: 10:66:6a:81:c6:73
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 062a171d-a647-4d6e-b81c-28c099a8f506
  volatile.uuid.generation: 062a171d-a647-4d6e-b81c-28c099a8f506
devices:
  gpu:
    id: "0"
    type: gpu
ephemeral: false
profiles:
- default
stateful: false

$ incus  start c2
Error: Failed to run: /nix/store/vjn8j2smqpib6g6bfdyvk0dvcqqsc2al-incus-6.15.0/bin/incusd forkstart c2 /var/lib/incus/containers /run/incus/c2/lxc.conf: exit status 1
Try `incus info --show-log c2` for more info
$ incus info --show-log c2
Name: c2
Description: 
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2025/08/30 05:38 IST
Last Used: 2025/08/30 05:38 IST

Log:

lxc c2 20250830000831.469 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc c2 20250830000831.469 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3933 - Failed to run mount hooks
lxc c2 20250830000831.469 ERROR    start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "c2"
lxc c2 20250830000831.469 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc c2 20250830000831.478 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth350a2ff1"
lxc c2 20250830000831.478 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:832 - Received container state "ABORTING" instead of "RUNNING"
lxc c2 20250830000831.478 ERROR    start - ../src/lxc/start.c:__lxc_start:2119 - Failed to spawn container "c2"
lxc c2 20250830000831.478 WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 553206
lxc 20250830000831.577 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250830000831.577 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"


incus is defined as following,

  virtualisation.incus.package = pkgs.incus;
  virtualisation.incus.enable = true;
  systemd.services.incus.environment.INCUS_LXC_HOOK =
    "${config.virtualisation.incus.lxcPackage}/share/lxc/hooks";

I noticed that generation 649 works with nvidia but 650 does not. Both releases have incus 6.15,

Generation  Build-date           NixOS version           Kernel   Configuration Revision  Specialisation  Current
650         2025-08-30 05:11:29  25.11.20250828.dfb2f12  6.12.43  Unknown                 []              True
649         2025-08-29 04:05:10  25.11.20250819.2007595  6.12.42  Unknown                 []              False

lxc may have been upgraded to 6.0.5 in that timeframe, but you’ll need to look at the commits and see if this is in that range. lxc: 6.0.4 -> 6.0.5 Ā· NixOS/nixpkgs@db7c9dc Ā· GitHub

I removed the patch because it no longer applied, assuming it’s in 6.0.5. If not in 6.0.5, it will need to be rebased, and if it is then perhaps there is another regression. Pull requests are accepted, but I don’t have the bandwidth to continually fix these nvidia runtime issues.

I looked at the package version when nvidia IS working and it is 6.0.5,

$ nix-store --query --requisites /run/current-system | cut -d- -f2- | sort -u|grep lxc
lxc-6.0.5
lxcfs-6.0.5
unit-lxcfs.service

So maybe it is another regression.

manually linking the binaries to /usr/bin makes it work… Im not really sure where where the issue should be opened or addressed…

Seems related, I have issues starting up container in void linux, the error too seems to be the nvidia hook. It seems after normal restart it’s fine but booting up from cold it can’t start the container.

Here are the startup logs:

lxc llama 20260112134425.985 INFO     lxccontainer - ../src/lxc/lxccontainer.c:do_lxcapi_start:959 - Set process title to [lxc monitor] /var/lib/incus/containers llama
lxc llama 20260112134425.986 INFO     start - ../src/lxc/start.c:lxc_check_inherited:326 - Closed inherited fd 4
lxc llama 20260112134425.986 INFO     start - ../src/lxc/start.c:lxc_check_inherited:326 - Closed inherited fd 5
lxc llama 20260112134425.986 INFO     start - ../src/lxc/start.c:lxc_check_inherited:326 - Closed inherited fd 9
lxc llama 20260112134425.986 INFO     lsm - ../src/lxc/lsm/lsm.c:lsm_init_static:38 - Initialized LSM security driver nop
lxc llama 20260112134425.986 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/proc/1563/exe callhook /var/lib/incus "default" "llama" start" for container "llama"
lxc llama 20260112134426.125 INFO     cgfsng - ../src/lxc/cgroups/cgfsng.c:unpriv_systemd_create_scope:1498 - Running privileged, not using a systemd unit
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "[all]"
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "reject_force_umount  # comment this to allow umount -f;  not recommended"
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:532 - Set seccomp rule to reject force umounts
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:532 - Set seccomp rule to reject force umounts
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:532 - Set seccomp rule to reject force umounts
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "[all]"
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "kexec_load errno 38"
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding native rule for syscall[246:kexec_load] action[327718:errno] arch[0]
lxc llama 20260112134426.127 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[246:kexec_load] action[327718:errno] arch[1073741827]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[246:kexec_load] action[327718:errno] arch[1073741886]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "open_by_handle_at errno 38"
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding native rule for syscall[304:open_by_handle_at] action[327718:errno] arch[0]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[304:open_by_handle_at] action[327718:errno] arch[1073741827]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[304:open_by_handle_at] action[327718:errno] arch[1073741886]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "init_module errno 38"
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding native rule for syscall[175:init_module] action[327718:errno] arch[0]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[175:init_module] action[327718:errno] arch[1073741827]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[175:init_module] action[327718:errno] arch[1073741886]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "finit_module errno 38"
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding native rule for syscall[313:finit_module] action[327718:errno] arch[0]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[313:finit_module] action[327718:errno] arch[1073741827]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[313:finit_module] action[327718:errno] arch[1073741886]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:815 - Processing "delete_module errno 38"
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding native rule for syscall[176:delete_module] action[327718:errno] arch[0]
lxc llama 20260112134426.128 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[176:delete_module] action[327718:errno] arch[1073741827]
lxc llama 20260112134426.129 INFO     seccomp - ../src/lxc/seccomp.c:do_resolve_add_rule:572 - Adding compat rule for syscall[176:delete_module] action[327718:errno] arch[1073741886]
lxc llama 20260112134426.129 INFO     seccomp - ../src/lxc/seccomp.c:parse_config_v2:1036 - Merging compat seccomp contexts into main context
lxc llama 20260112134426.129 INFO     start - ../src/lxc/start.c:lxc_init:882 - Container "llama" is initialized
lxc llama 20260112134426.130 INFO     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_monitor_create:1669 - The monitor process uses "lxc.monitor.llama" as cgroup
lxc llama 20260112134426.325 INFO     cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_payload_create:1777 - The container process uses "lxc.payload.llama" as inner and "lxc.payload.llama" as limit cgroup
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWUSER
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWNS
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWPID
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWUTS
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWIPC
lxc llama 20260112134426.443 INFO     start - ../src/lxc/start.c:lxc_spawn:1769 - Cloned CLONE_NEWCGROUP
lxc llama 20260112134426.499 INFO     idmap_utils - ../src/lxc/idmap_utils.c:lxc_map_ids:176 - Caller maps host root. Writing mapping directly
lxc llama 20260112134426.500 NOTICE   utils - ../src/lxc/utils.c:lxc_drop_groups:1477 - Dropped supplimentary groups
lxc llama 20260112134426.512 INFO     start - ../src/lxc/start.c:do_start:1105 - Unshared CLONE_NEWNET
lxc llama 20260112134426.512 NOTICE   utils - ../src/lxc/utils.c:lxc_drop_groups:1477 - Dropped supplimentary groups
lxc llama 20260112134426.512 NOTICE   utils - ../src/lxc/utils.c:lxc_switch_uid_gid:1453 - Switched to gid 0
lxc llama 20260112134426.512 NOTICE   utils - ../src/lxc/utils.c:lxc_switch_uid_gid:1462 - Switched to uid 0
lxc llama 20260112134426.737 INFO     conf - ../src/lxc/conf.c:setup_utsname:683 - Set hostname to "llama"
lxc llama 20260112134426.740 INFO     network - ../src/lxc/network.c:lxc_setup_network_in_child_namespaces:4064 - Finished setting up network devices with caller assigned names
lxc llama 20260112134426.741 INFO     conf - ../src/lxc/conf.c:mount_autodev:1027 - Preparing "/dev"
lxc llama 20260112134426.742 INFO     conf - ../src/lxc/conf.c:mount_autodev:1088 - Prepared "/dev"
lxc llama 20260112134426.173 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxc/hooks/nvidia" for container "llama"
lxc llama 20260112134426.196 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc llama 20260112134426.196 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3944 - Failed to run mount hooks
lxc llama 20260112134426.196 ERROR    start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "llama"
lxc llama 20260112134426.196 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc llama 20260112134426.199 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth0223f602"
lxc llama 20260112134426.199 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc llama 20260112134426.199 ERROR    start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "llama"
lxc llama 20260112134426.199 WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 2203
lxc llama 20260112134426.199 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/libexec/incus/incusd callhook /var/lib/incus "default" "llama" stopns" for container "llama"
lxc 20260112134426.269 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20260112134426.269 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20260112134426.269 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"
lxc 20260112134426.269 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"
lxc llama 20260112134426.269 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/libexec/incus/incusd callhook /var/lib/incus "default" "llama" stop" for container "llama"

We’ve seen report that on some distros, a run of nvidia-smi is needed on the host to get some stuff initialized.

1 Like

Thank you. Seems to have indeed solved my issues.