Lxd container with gpu Failed to initialize

badsmoke · October 13, 2020, 8:55am

hello I would like to give my gpu to one or more containers

host/container: ubuntu:20.04

I have carried out the following:

host:

apt install nvidia-headless-450-server nvidia-utils-450-server nvidia-cuda-toolkit
lxc launch gpu-test
lxc config device add gpu-test gpu gpu

nvidia-smi and nvcc -V working

container:

apt install nvidia-headless-450-server nvidia-utils-450-server nvidia-cuda-toolkit

nvcc -V show the same
nvidia-smi says:

Failed to initialize NVML: Driver/library version mismatc

they should be exactly the same versions.

any idea what could be the reason for this?

simos · October 13, 2020, 10:33am

Hi!

Now in LXD there is a nvidia.runtime key that makes LXD to put the appropriate runtime into the container so that you do not have to. Hence, you avoid the version mismatch issues.

badsmoke · October 13, 2020, 11:10am

thank you for the hint, I tried it out

lxc profie:

config:
  limits.cpu: "3"
  limits.memory: 4GB
  nvidia.runtime: "true"
  security.nesting: "true"
  security.privileged: "true"
description: default gpu
devices:
  eth0:
    nictype: bridged
    parent: br1
    type: nic
  root:
    path: /
    pool: LXDpro
    size: 64GB
    type: disk
name: gpu-pass
used_by:
- /1.0/instances/cuda

lxc launch ubuntu: cuda -p gpu-pass

Creating cuda
Starting cuda
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart cuda /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/cuda/lxc.conf: 
Try `lxc info --show-log local:cuda` for more info

lxc info --show-log cuda

Name: cuda
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/10/13 11:09 UTC
Status: Stopped
Type: container
Profiles: gpu-pass

Log:

lxc cuda 20201013110904.104 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.cuda"
lxc cuda 20201013110904.104 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.cuda"
lxc cuda 20201013110904.189 ERROR    conf - conf.c:run_buffer:324 - Script exited with status 1
lxc cuda 20201013110904.190 ERROR    conf - conf.c:lxc_setup:3292 - Failed to run mount hooks
lxc cuda 20201013110904.190 ERROR    start - start.c:do_start:1224 - Failed to setup container "cuda"
lxc cuda 20201013110904.190 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc cuda 20201013110904.192 WARN     network - network.c:lxc_delete_network_priv:3185 - Failed to rename interface with index 0 from "eth0" to its initial name "vethbbb99d3b"
lxc cuda 20201013110904.192 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:850 - Received container state "ABORTING" instead of "RUNNING"
lxc cuda 20201013110904.192 ERROR    start - start.c:__lxc_start:1999 - Failed to spawn container "cuda"
lxc cuda 20201013110904.192 WARN     start - start.c:lxc_abort:1019 - No such process - Failed to send SIGKILL via pidfd 31 for process 35741
lxc 20201013110904.316 WARN     commands - commands.c:lxc_cmd_rsp_recv:124 - Connection reset by peer - Failed to receive response for command "get_state"

simos · October 13, 2020, 11:43am

I am on mobile and cannot give full examples.

You need the two nvidia.* keys as shown at https://blog.simos.info/running-x11-software-in-lxd-containers/ The other key is about enabling components of the runtime. For CUDA, you probably need one component. If unsure, start with “all”.

Do not use the security.* unless you need their additional functionality. You can do CUDA without these keys.

badsmoke · October 13, 2020, 12:05pm

thank you, the security flag seems to have been the problem now it works.

i often use docker containers, for this i sometimes need the security flag