Incus 6.11 Update: Containers with (Nvidia) GPU passthrough will not start

ChuckPa · April 4, 2025, 8:11pm

We use Incus (LXC) containers for all our server release testing (some 20+ containers)

Under Incus 6.10, with Nvidia and host processor (QuickSyncVideo) GPUs passed into the container, with Nvidia runtime libraries, Everything worked flawlessly.

In preparation for updating our server hosts to Incus 6.11, I updated my workstation (Ubuntu 22,04)

Now, any container with the Nvidia runtime library passed into it, will refuse to launch
If I remove the nvidia.driver.xxxx statements, the container will launch but the Nvidia will not function.

Reading documentation implies Nvidia changes in 6.11 are responsible but I can’t figure out what’s needed.

I would appreciate an assist in which changes are required.

Create generic images:ubuntu/22.04 container
Add the host processor GPU (QSV) and the Nvidia GPU (Discrete)
Attempt to launch

[chuck@lizum ~.2012]$ incus-gpu plex
Device GPUs added to plex
GPU configuration added to 'plex'
Restarting plex
Error: Failed to run: /opt/incus/bin/incusd forkstart plex /var/lib/incus/containers /run/incus/plex/lxc.conf: exit status 1
Try `incus info --show-log plex` for more info
[chuck@lizum ~.2013]$ incus info --show-log plex
Name: plex
Description: 
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2025/04/04 15:56 EDT
Last Used: 2025/04/04 15:57 EDT

Log:

lxc plex 20250404195744.224 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc plex 20250404195744.224 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3948 - Failed to run mount hooks
lxc plex 20250404195744.224 ERROR    start - ../src/lxc/start.c:do_start:1268 - Failed to setup container "plex"
lxc plex 20250404195744.224 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc plex 20250404195744.228 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth53c800f8"
lxc plex 20250404195744.228 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc plex 20250404195744.228 ERROR    start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "plex"
lxc plex 20250404195744.228 WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 2815984
lxc 20250404195744.290 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250404195744.290 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

[chuck@lizum ~.2014]$

This is the entire container config:

[chuck@lizum ~.2014]$ incus config show plex
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20250402_07:42)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20250402_07:42"
  image.type: squashfs
  image.variant: default
  nvidia.driver.capabilities: all
  nvidia.require.cuda: "true"
  nvidia.runtime: "true"
  volatile.base_image: 6367014c9f0e92a6508be10660abb609e23fd0e9e03497d23f4f57679ede93ac
  volatile.cloud-init.instance-id: a3da4afe-f7fa-4cb4-ac9b-fe29f2846e92
  volatile.eth0.hwaddr: 10:66:6a:52:b3:05
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: c8c51e09-9c4c-48fe-a8ac-c42d33a08e1c
  volatile.uuid.generation: c8c51e09-9c4c-48fe-a8ac-c42d33a08e1c
devices:
  GPUs:
    gid: "110"
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""
[chuck@lizum ~.2015]$

The setup script for our containers is very basic (uniformity)

This is the part of the script which adds both Nvidia and Host GPUs to the container.


Gid="$(stat -c %g /dev/dri/renderD128)"

# Add it (Both Intel and Nvidia)
incus config device add "$1" GPUs gpu gid=$Gid

# Add Nvidia runtime 
incus config set "$1" nvidia.driver.capabilities all
incus config set "$1" nvidia.require.cuda true
incus config set "$1" nvidia.runtime true

ChuckPa · April 4, 2025, 9:15pm

Additional info:

Existing containers which use the Nvidia GPU through a profile work as expected.

[chuck@lizum ~.2038]$ incus config show plexdev
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20250222_07:42)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20250222_07:42"
  image.type: squashfs
  image.variant: default
  volatile.base_image: 9aab5e0a0a792348c8e8dc60e4d61aedd4360e0f89de9969c081721100f6fdcc
  volatile.cloud-init.instance-id: 6e8f49f9-c81e-4786-b9f6-3421ee1e4ad5
  volatile.eth0.host_name: veth5571e4e7
  volatile.eth0.hwaddr: 00:16:3e:c0:72:21
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 93ec99db-15bb-42e8-91cc-c7f7dad6e95b
  volatile.uuid.generation: 93ec99db-15bb-42e8-91cc-c7f7dad6e95b
devices:
  git:
    path: /git
    source: /glock/git/plex-media-server/postproc/
    type: disk
  postproc:
    path: /postproc
    source: /glock/git/plex-media-server/postproc
    type: disk
ephemeral: false
profiles:
- default
- nvidia
stateful: false
description: ""
[chuck@lizum ~.2039]$ incus shell plexdev
root@plexdev:~# nvidia-smi
Fri Apr  4 21:13:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:01:00.0  On |                  Off |
| 30%   43C    P8              8W /   70W |    1815MiB /  16380MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@plexdev:~#

[chuck@lizum ~.2016]$ incus profile show nvidia
config:
  nvidia.driver.capabilities: all
  nvidia.require.cuda: "true"
  nvidia.runtime: "true"
description: Nvidia GPU profile
devices:
  gpus:
    gid: "110"
    type: gpu
name: nvidia
used_by:
- /1.0/instances/plexdev
project: default
[chuck@lizum ~.2017]$

gouache · April 5, 2025, 4:53am

Greetings, ChuckY. I’m in a similar situation due to the 6.11 update.
My containers with the nvidia.runtime=“true” setting refuse to start, but if I change the value to “false,” they start, but obviously without using the graphics card. I’ve also tried your recent contribution via profiling, but it doesn’t work for me. I’d appreciate any guidance.

stgraber · April 5, 2025, 4:54am

github.com/zabbly/incus

Incus 6.11 breaks Nvidia Compute Runtime

opened 02:10AM - 05 Apr 25 UTC

lqvnguyen

After updating to the latest version of Incus 6.11, I can no longer start any of… my containers that have the following in its config or part of a profile: config: nvidia.driver.capabilities: all nvidia.require.cuda: "true" nvidia.runtime: "true" **incus start ai** Error: Failed to run: /opt/incus/bin/incusd forkstart ai /var/lib/incus/containers /run/incus/ai/lxc.conf: exit status 1 Try `incus info --show-log ai` for more info **incus info --show-log ai** Name: ai Description: Status: STOPPED Type: container Architecture: x86_64 Created: 2025/02/22 02:19 PST Last Used: 2025/04/04 19:08 PDT Log: lxc ai 20250405020851.656 ERROR utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1 lxc ai 20250405020851.656 ERROR conf - ../src/lxc/conf.c:lxc_setup:3948 - Failed to run mount hooks lxc ai 20250405020851.656 ERROR start - ../src/lxc/start.c:do_start:1268 - Failed to setup container "ai" lxc ai 20250405020851.656 ERROR sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4) lxc ai 20250405020851.662 WARN network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth129a886f" lxc ai 20250405020851.662 ERROR start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "ai" lxc ai 20250405020851.662 WARN start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 6716 lxc ai 20250405020851.662 ERROR lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING" lxc 20250405020851.750 ERROR af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response lxc 20250405020851.750 ERROR commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

gsggage · April 6, 2025, 1:43am

Hey, I ran into this issue yesterday and I’m still experiencing it after updating to incus/noble,now 1:6.11-ubuntu24.04-202504052021 amd64 and rebooting. Appears to be the same issue and still occurring on this setup.

Configuration:

gage@r730:~$ incus config show openwebui1 --expanded
architecture: x86_64
config:
  environment.ANONYMIZED_TELEMETRY: "false"
  environment.DO_NOT_TRACK: "true"
  environment.DOCKER: "true"
  environment.ENV: prod
  environment.GPG_KEY: A035C8C19219BA821ECEA86B64E628F8D684696D
  environment.HF_HOME: /app/backend/data/cache/embedding/models
  environment.HOME: /root
  environment.LANG: C.UTF-8
  environment.OLLAMA_BASE_URL: /ollama
  environment.PATH: /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  environment.PORT: "8080"
  environment.PYTHON_SHA256: 2a9920c7a0cd236de33644ed980a13cbbc21058bfdc528febb6081575ed73be3
  environment.PYTHON_VERSION: 3.11.11
  environment.RAG_EMBEDDING_MODEL: sentence-transformers/all-MiniLM-L6-v2
  environment.SCARF_NO_ANALYTICS: "true"
  environment.SENTENCE_TRANSFORMERS_HOME: /app/backend/data/cache/embedding/models
  environment.TERM: xterm
  environment.TIKTOKEN_CACHE_DIR: /app/backend/data/cache/tiktoken
  environment.TIKTOKEN_ENCODING_NAME: cl100k_base
  environment.USE_CUDA_DOCKER: "false"
  environment.USE_CUDA_DOCKER_VER: cu121
  environment.USE_EMBEDDING_MODEL_DOCKER: sentence-transformers/all-MiniLM-L6-v2
  environment.USE_OLLAMA_DOCKER: "true"
  environment.WEBUI_BUILD_VERSION: 04799f1f95f958674d35ba4854ef62754a4d332e
  environment.WHISPER_MODEL: base
  environment.WHISPER_MODEL_DIR: /app/backend/data/cache/whisper/models
  image.architecture: x86_64
  image.description: ghcr.io/open-webui/open-webui (OCI)
  image.id: open-webui/open-webui:ollama
  image.type: oci
  limits.cpu: "30"
  limits.memory: 762GiB
  nvidia.driver.capabilities: all
  nvidia.require.cuda: "true"
  nvidia.runtime: "true"
  oci.cwd: /app/backend
  oci.entrypoint: bash start.sh
  oci.gid: "0"
  oci.uid: "0"
  volatile.base_image: db1cbb159ee2074b3beb00bd5f958660c4e1cc27dce66b355c7b1dfaf076956b
  volatile.cloud-init.instance-id: 8a40c341-5d07-4da1-af1e-b074ef20dae3
  volatile.container.oci: "true"
  volatile.eth30.hwaddr: 10:66:6a:8d:d5:38
  volatile.eth30.name: eth0
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: c06f3e08-b9c0-490e-8145-3f753e14c8c8
  volatile.uuid.generation: c06f3e08-b9c0-490e-8145-3f753e14c8c8
devices:
  data:
    path: /app/backend/data
    pool: default
    source: openwebui1-data
    type: disk
  eth30:
    mtu: "1500"
    nictype: bridged
    parent: br10
    type: nic
    vlan: "30"
  gpu:
    type: gpu
  ollama:
    path: /root/.ollama
    pool: default
    source: openwebui1-ollama
    type: disk
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- defaultContainer
- vlan30
stateful: false
description: ""

Output of incus info --show-log

gage@r730:~$ incus info --show-log openwebui1
Name: openwebui1
Description: 
Status: STOPPED
Type: container (application)
Architecture: x86_64
Created: 2025/04/05 05:39 UTC
Last Used: 2025/04/06 01:23 UTC

Log:

lxc openwebui1 20250406012341.173 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc openwebui1 20250406012341.173 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3948 - Failed to run mount hooks
lxc openwebui1 20250406012341.173 ERROR    start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "openwebui1"
lxc openwebui1 20250406012341.173 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc openwebui1 20250406012341.184 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth28643b19"
lxc openwebui1 20250406012341.184 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc openwebui1 20250406012341.184 ERROR    start - ../src/lxc/start.c:__lxc_start:2119 - Failed to spawn container "openwebui1"
lxc openwebui1 20250406012341.184 WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 5455
lxc 20250406012341.286 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250406012341.286 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

stgraber · April 6, 2025, 3:32am

^ That’s the problem.

In the past we were basically overriding the PATH variable for you, now we respect it.
Because that variable doesn’t include /opt/incus/bin/ where the nvidia tools are located, your container doesn’t see them and startup fails.

Either add :/opt/incus/bin to the variable or unset it altogether, either way, that should fix it.

gsggage · April 7, 2025, 12:48am

That did it, thank you!

trumee · April 19, 2025, 5:28pm

@stgraber Is there going to a new release with this patch? Or is it possible to downgrade to a previous working version (on NixOS)?

stgraber · April 19, 2025, 5:30pm

We’re not currently planning to do a release just for that single commit as we normally coordinate point releases across all our projects.

We’ve been encouraging distribution maintainers to pull in that one extra commit in their package. Maybe @adamcstephens can help get that into the Nix one?

adamcstephens · April 20, 2025, 2:04am

It will be a couple days before I’ll get to this to add a patch in the NixOS package, and then another few days after that for it to make it into a release branch. I’d be happy to review a PR if someone else has the time. Otherwise it’s relatively simple to use overrideAttrs to add a patch to the package locally in your config.

trumee · April 24, 2025, 7:36pm

@adamcstephens any luck with this?

trumee · April 29, 2025, 3:22pm

I updated to incus 6.12 on NixOS and tested ‘nvidia.runtime’ set to true. Unfortunately, the container doesnt start,

Log:

lxc c1 20250429151800.433 ERROR    utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc c1 20250429151800.433 ERROR    conf - ../src/lxc/conf.c:lxc_setup:3948 - Failed to run mount hooks
lxc c1 20250429151800.433 ERROR    start - ../src/lxc/start.c:do_start:1268 - Failed to setup container "c1"
lxc c1 20250429151800.433 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc c1 20250429151800.442 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "vethb531aa8e"
lxc c1 20250429151800.442 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc c1 20250429151800.442 ERROR    start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "c1"
lxc c1 20250429151800.442 WARN     start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 131529
lxc 20250429151800.555 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250429151800.555 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

@stgraber Is this the same bug as 6.11 or a different one?

stgraber · April 29, 2025, 3:26pm

The bug isn’t in Incus, it’s in liblxc, so if your distro didn’t pick up the liblxc fix, updating Incus 6.12 won’t help.

trumee · April 29, 2025, 3:29pm

Right. The lxc version is NixOS is 6.0.4. Is this the version which needs to be patched?

stgraber · April 29, 2025, 3:29pm

Yep

trumee · April 29, 2025, 6:48pm

Here is how i patched NixOS and can report nvidia is working in the container

Downloaded patch https://github.com/lxc/lxc/pull/4536.patch
Added the following overlay to configuration.nix

  nixpkgs.overlays = [ (final: prev:
  { 
   lxc = prev.lxc.overrideAttrs (oldAttrs: {
  patches  = (oldAttrs.patches or []) ++ [./4536.patch];
    });
   }
   ) ];

Rebuild the system using sudo nixos-rebuild switch --flake .

adamcstephens · May 15, 2025, 1:08pm

Please consider opening PRs if fixes like this will be useful for others. I only have so much bandwidth and nixpkgs is a community project.

trumee · May 16, 2025, 7:57am

@adamcstephens I used your override but gpu became unavailable.

adamcstephens · May 16, 2025, 11:01am

Ahh I see why in this case. Please consider opening a PR to fix it properly.

ckruijntjens · May 22, 2025, 9:21pm

Hi stgraber,

I also upgraded to incus 6.12 and have this bug? how can we resolve this in liblxc? or is there coming a patch in the near future?