Hi there!
I wanted to reproduce the ASCIInema example for using the new nvidia.runtime
feature from the 3.0.0 announcement post on my server, but have some issue. After adding the GPU device and setting nvidia.runtime
to true
, the container refuses to start…
(Note on the host: running Ubuntu 16.04 LTS with latest updates. LXD is installed as a snap package)
me@host:~$ snap info lxd
name: lxd
summary: System container manager and API
publisher: canonical
contact: https://github.com/lxc/lxd/issues
license: unknown
description: <omitted>
services:
lxd.daemon: simple, enabled, active
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: 3.0/stable
refreshed: 2018-05-03T22:59:39+08:00
installed: 3.0.0 (6960) 56MB -
channels:
stable: 3.0.0 (6954) 56MB -
candidate: 3.0.0 (7018) 56MB -
beta: ↑
edge: git-a81aac8 (7013) 56MB -
2.0/stable: 2.0.11 (6627) 27MB -
2.0/candidate: 2.0.11 (7028) 28MB -
2.0/beta: ↑
2.0/edge: git-e48b686 (7029) 26MB -
3.0/stable: 3.0.0 (6960) 56MB -
3.0/candidate: 3.0.0 (7021) 56MB -
3.0/beta: ↑
3.0/edge: git-8669276 (7009) 56MB -
On the host:
me@host:~$ nvidia-smi
Wed May 9 10:26:14 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4200 Off | 00000000:03:00.0 On | N/A |
| 30% 39C P8 14W / 110W | 94MiB / 4028MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8689 G /usr/lib/xorg/Xorg 91MiB |
+-----------------------------------------------------------------------------+
me@host:~$ lxc list
+---------+---------+---------------------+----------------------------------------------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------+---------+---------------------+----------------------------------------------+------------+-----------+
| ubuntu1 | RUNNING | 10.70.242.96 (eth0) | fd42:835b:2b82:ec76:216:3eff:feb3:125 (eth0) | PERSISTENT | 0 |
+---------+---------+---------------------+----------------------------------------------+------------+-----------+
me@host:~$ lxc config device add ubuntu1 k4200 gpu
Device k4200 added to ubuntu1
me@host:~$ lxc config set ubuntu1 nvidia.runtime true
me@host:~$ lxc restart ubuntu1
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart ubuntu1 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/ubuntu1/lxc.conf:
Try `lxc info --show-log ubuntu1` for more info
me@host:~$ lxc info --show-log ubuntu1
Name: ubuntu1
Remote: unix://
Architecture: x86_64
Created: 2018/05/01 07:07 UTC
Status: Stopped
Type: persistent
Profiles: default
Log:
lxc 20180509022728.929 WARN lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022728.929 WARN lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022728.932 WARN lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022728.932 WARN lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022729.184 ERROR lxc_conf - conf.c:run_buffer:347 - Script exited with status 127
lxc 20180509022729.184 ERROR lxc_conf - conf.c:lxc_setup:3391 - Failed to run mount hooks
lxc 20180509022729.184 ERROR lxc_start - start.c:do_start:1198 - Failed to setup container "ubuntu1"
lxc 20180509022729.184 ERROR lxc_sync - sync.c:__sync_wait:57 - An error occurred in another process (expected sequence number 5)
lxc 20180509022729.255 ERROR lxc_container - lxccontainer.c:wait_on_daemonized_start:824 - Received container state "ABORTING" instead of "RUNNING"
lxc 20180509022729.255 ERROR lxc_start - start.c:__lxc_start:1866 - Failed to spawn container "ubuntu1"
lxc 20180509022729.256 WARN lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022729.256 WARN lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022729.259 WARN lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
I’ve tried lxc start --debug ubuntu1
but the output was only related to the API exchanges with the LXD deamon and didn’t have anything useful.
However, if I unset the nvidia.runtime
config setting, then I can start the container. But nvidia-smi
is not found in the container…
me@host:~$ lxc config unset ubuntu1 nvidia.runtime
me@host:~$ lxc start ubuntu1
me@host:~$ lxc exec ubuntu1 bash
root@ubuntu1:~# nvidia-smi
nvidia-smi: command not found
Do you have any idea what could be wrong and where to look for clues?