Lxd container with gpu Failed to initialize

hello I would like to give my gpu to one or more containers

host/container: ubuntu:20.04

I have carried out the following:

host:

apt install nvidia-headless-450-server nvidia-utils-450-server nvidia-cuda-toolkit
lxc launch gpu-test
lxc config device add gpu-test gpu gpu

nvidia-smi and nvcc -V working

container:

apt install nvidia-headless-450-server nvidia-utils-450-server nvidia-cuda-toolkit

nvcc -V show the same
nvidia-smi says:

Failed to initialize NVML: Driver/library version mismatc

they should be exactly the same versions.

any idea what could be the reason for this?

Hi!

Now in LXD there is a nvidia.runtime key that makes LXD to put the appropriate runtime into the container so that you do not have to. Hence, you avoid the version mismatch issues.

thank you for the hint, I tried it out

lxc profie:

config:
  limits.cpu: "3"
  limits.memory: 4GB
  nvidia.runtime: "true"
  security.nesting: "true"
  security.privileged: "true"
description: default gpu
devices:
  eth0:
    nictype: bridged
    parent: br1
    type: nic
  root:
    path: /
    pool: LXDpro
    size: 64GB
    type: disk
name: gpu-pass
used_by:
- /1.0/instances/cuda

lxc launch ubuntu: cuda -p gpu-pass

Creating cuda
Starting cuda
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart cuda /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/cuda/lxc.conf: 
Try `lxc info --show-log local:cuda` for more info

lxc info --show-log cuda

Name: cuda
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/10/13 11:09 UTC
Status: Stopped
Type: container
Profiles: gpu-pass

Log:

lxc cuda 20201013110904.104 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.cuda"
lxc cuda 20201013110904.104 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.cuda"
lxc cuda 20201013110904.189 ERROR    conf - conf.c:run_buffer:324 - Script exited with status 1
lxc cuda 20201013110904.190 ERROR    conf - conf.c:lxc_setup:3292 - Failed to run mount hooks
lxc cuda 20201013110904.190 ERROR    start - start.c:do_start:1224 - Failed to setup container "cuda"
lxc cuda 20201013110904.190 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc cuda 20201013110904.192 WARN     network - network.c:lxc_delete_network_priv:3185 - Failed to rename interface with index 0 from "eth0" to its initial name "vethbbb99d3b"
lxc cuda 20201013110904.192 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:850 - Received container state "ABORTING" instead of "RUNNING"
lxc cuda 20201013110904.192 ERROR    start - start.c:__lxc_start:1999 - Failed to spawn container "cuda"
lxc cuda 20201013110904.192 WARN     start - start.c:lxc_abort:1019 - No such process - Failed to send SIGKILL via pidfd 31 for process 35741
lxc 20201013110904.316 WARN     commands - commands.c:lxc_cmd_rsp_recv:124 - Connection reset by peer - Failed to receive response for command "get_state"

I am on mobile and cannot give full examples.

You need the two nvidia.* keys as shown at https://blog.simos.info/running-x11-software-in-lxd-containers/ The other key is about enabling components of the runtime. For CUDA, you probably need one component. If unsure, start with “all”.

Do not use the security.* unless you need their additional functionality. You can do CUDA without these keys.

thank you, the security flag seems to have been the problem now it works.

i often use docker containers, for this i sometimes need the security flag