Problems trying to use nvidia.runtime with snap LXD 3.0.0

zertrin · May 9, 2018, 3:03am

Hi there!

I wanted to reproduce the ASCIInema example for using the new nvidia.runtime feature from the 3.0.0 announcement post on my server, but have some issue. After adding the GPU device and setting nvidia.runtime to true, the container refuses to start…

(Note on the host: running Ubuntu 16.04 LTS with latest updates. LXD is installed as a snap package)

me@host:~$ snap info lxd
name:      lxd
summary:   System container manager and API
publisher: canonical
contact:   https://github.com/lxc/lxd/issues
license:   unknown
description: <omitted>
services:
  lxd.daemon: simple, enabled, active
snap-id:   J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:  3.0/stable
refreshed: 2018-05-03T22:59:39+08:00
installed:       3.0.0       (6960) 56MB -
channels:                           
  stable:        3.0.0       (6954) 56MB -
  candidate:     3.0.0       (7018) 56MB -
  beta:          ↑                       
  edge:          git-a81aac8 (7013) 56MB -
  2.0/stable:    2.0.11      (6627) 27MB -
  2.0/candidate: 2.0.11      (7028) 28MB -
  2.0/beta:      ↑                       
  2.0/edge:      git-e48b686 (7029) 26MB -
  3.0/stable:    3.0.0       (6960) 56MB -
  3.0/candidate: 3.0.0       (7021) 56MB -
  3.0/beta:      ↑                       
  3.0/edge:      git-8669276 (7009) 56MB -

On the host:

me@host:~$ nvidia-smi 
Wed May  9 10:26:14 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4200        Off  | 00000000:03:00.0  On |                  N/A |
| 30%   39C    P8    14W / 110W |     94MiB /  4028MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8689      G   /usr/lib/xorg/Xorg                            91MiB |
+-----------------------------------------------------------------------------+

me@host:~$ lxc list
+---------+---------+---------------------+----------------------------------------------+------------+-----------+
|  NAME   |  STATE  |        IPV4         |                     IPV6                     |    TYPE    | SNAPSHOTS |
+---------+---------+---------------------+----------------------------------------------+------------+-----------+
| ubuntu1 | RUNNING | 10.70.242.96 (eth0) | fd42:835b:2b82:ec76:216:3eff:feb3:125 (eth0) | PERSISTENT | 0         |
+---------+---------+---------------------+----------------------------------------------+------------+-----------+

me@host:~$ lxc config device add ubuntu1 k4200 gpu     
Device k4200 added to ubuntu1

me@host:~$ lxc config set ubuntu1 nvidia.runtime true

me@host:~$ lxc restart ubuntu1
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart ubuntu1 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/ubuntu1/lxc.conf: 
Try `lxc info --show-log ubuntu1` for more info

me@host:~$ lxc info --show-log ubuntu1
Name: ubuntu1
Remote: unix://
Architecture: x86_64
Created: 2018/05/01 07:07 UTC
Status: Stopped
Type: persistent
Profiles: default

Log:

lxc 20180509022728.929 WARN     lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022728.929 WARN     lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022728.932 WARN     lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022728.932 WARN     lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022729.184 ERROR    lxc_conf - conf.c:run_buffer:347 - Script exited with status 127
lxc 20180509022729.184 ERROR    lxc_conf - conf.c:lxc_setup:3391 - Failed to run mount hooks
lxc 20180509022729.184 ERROR    lxc_start - start.c:do_start:1198 - Failed to setup container "ubuntu1"
lxc 20180509022729.184 ERROR    lxc_sync - sync.c:__sync_wait:57 - An error occurred in another process (expected sequence number 5)
lxc 20180509022729.255 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:824 - Received container state "ABORTING" instead of "RUNNING"
lxc 20180509022729.255 ERROR    lxc_start - start.c:__lxc_start:1866 - Failed to spawn container "ubuntu1"
lxc 20180509022729.256 WARN     lxc_conf - conf.c:lxc_map_ids:2831 - newuidmap binary is missing
lxc 20180509022729.256 WARN     lxc_conf - conf.c:lxc_map_ids:2837 - newgidmap binary is missing
lxc 20180509022729.259 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"

I’ve tried lxc start --debug ubuntu1 but the output was only related to the API exchanges with the LXD deamon and didn’t have anything useful.

However, if I unset the nvidia.runtime config setting, then I can start the container. But nvidia-smi is not found in the container…

me@host:~$ lxc config unset ubuntu1 nvidia.runtime
me@host:~$ lxc start ubuntu1
me@host:~$ lxc exec ubuntu1 bash
root@ubuntu1:~# nvidia-smi
nvidia-smi: command not found

Do you have any idea what could be wrong and where to look for clues?

stgraber · May 9, 2018, 2:40pm

What does nvidia-container-cli info show?

stgraber · May 9, 2018, 2:41pm

The error above shows that the nvidia LXC hook has failed, but unfortunately doesn’t really ellaborate about what failed exactly…

zertrin · May 10, 2018, 3:03am

Where is is supposed to be? on the host or inside the container?
Currently on the host, the command is not found, and in the container without the nvidia.runtime config option set, it is also command not found… And enabling nvidia.runtime prevents the container from starting…

[2 minutes later]

Ok this is what I found:

me@host:~$ locate nvidia-container
/snap/lxd/6864/bin/nvidia-container-cli
/snap/lxd/6882/bin/nvidia-container-cli
/snap/lxd/6960/bin/nvidia-container-cli

me@host:~$ /snap/lxd/6960/bin/nvidia-container-cli info
basename: missing operand
Try 'basename --help' for more information.
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found

Looking at /snap/lxd/6960/bin/nvidia-container-cli reveals that it is a wrapper:

#!/bin/sh

# Set environment to run nvidia-container-cli from the host system
export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}:/var/lib/snapd/hostfs/usr/lib/${ARCH}"

exec /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli -r /var/lib/snapd/hostfs/ "$@"

So I tried to debug it a bit:

First of all, I don’t know where the SNAP env var should be coming from. This environment variable is not set on the host, nor in the container.

Without it, the wrapper script fails:

The first variable (SNAP_CURRENT) will not be set correctly:

me@host:~$ export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
me@host:~$ echo $SNAP_CURRENT
//current

Which cascade into:

me@host:~$ export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
zsh: no matches found: //current/lib/*-linux-gnu/
basename: missing operand
Try 'basename --help' for more information.

Sooo, now I took the hypothesis that the SNAP env var was meant to be set similar like this:

me@host:~$ export SNAP="/snap/lxd/6960"

Then:

me@host:~$ export SNAP_CURRENT="$(realpath "${SNAP}/..")/current"
me@host:~$ echo $SNAP_CURRENT                    
/snap/lxd/current
me@host:~$ export ARCH="$(basename $(readlink -f ${SNAP_CURRENT}/lib/*-linux-gnu/))"
me@host:~$ echo $ARCH
x86_64-linux-gnu
me@host:~$ export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}:/var/lib/snapd/hostfs/usr/lib/${ARCH}"
me@host:~$ echo $LD_LIBRARY_PATH 
/home/redacted/bin/gurobi751/linux64/lib:/var/lib/snapd/hostfs/usr/lib/x86_64-linux-gnu

So it looks like it should work now:

me@host:~$ /snap/lxd/6960/bin/nvidia-container-cli info                                       
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found

Still not (but the basename: missing operand error is gone)

me@host:~$ ls -l /var/lib/snapd/hostfs 
total 0

Rhaaa, I should be inside the snap. Ok then:

sudo nsenter -t $(pgrep daemon.start) -m bash

And now let’s try it again:

me@lxd-snap:~$ ls -l /var/lib/snapd/hostfs
total 116
drwxr-xr-x     2 root root   4096 Apr 27 09:41 bin
drwxr-xr-x     3 root root   4096 May  9 22:23 boot
...

The SNAP env var is still not set by default, so I do it manually:

me@lxd-snap:~$ export SNAP="/snap/lxd/6960"

Moment of truth:

me@lxd-snap:~$ /snap/lxd/6960/bin/nvidia-container-cli info
/snap/lxd/6960/bin/nvidia-container-cli: 8: exec: /var/lib/snapd/hostfs/usr/bin/nvidia-container-cli: not found

Still not!?

me@lxd-snap:~$ ls /var/lib/snapd/hostfs/usr/bin/nvidia-*
/var/lib/snapd/hostfs/usr/bin/nvidia-bug-report.sh     /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced
/var/lib/snapd/hostfs/usr/bin/nvidia-cuda-mps-control  /var/lib/snapd/hostfs/usr/bin/nvidia-settings
/var/lib/snapd/hostfs/usr/bin/nvidia-cuda-mps-server   /var/lib/snapd/hostfs/usr/bin/nvidia-smi
/var/lib/snapd/hostfs/usr/bin/nvidia-debugdump         /var/lib/snapd/hostfs/usr/bin/nvidia-xconfig
/var/lib/snapd/hostfs/usr/bin/nvidia-detector

So, now I reckon that I’m missing nvidia-container-cli on the host under /usr/bin…

Why is it not there?

zertrin · May 10, 2018, 4:44am

Some more Googling and

me@host:~$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -   
[sudo] password for redacted: 
OK
me@host:~$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list 
deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/$(ARCH) /
me@host:~$ sudo apt update
me@host:~$ sudo apt install nvidia-container-runtime

And now:

me@host:~$ ls /usr/bin/nvidia-*
/usr/bin/nvidia-bug-report.sh           /usr/bin/nvidia-cuda-mps-control  /usr/bin/nvidia-persistenced
/usr/bin/nvidia-container-cli           /usr/bin/nvidia-cuda-mps-server   /usr/bin/nvidia-settings
/usr/bin/nvidia-container-runtime       /usr/bin/nvidia-debugdump         /usr/bin/nvidia-smi
/usr/bin/nvidia-container-runtime-hook  /usr/bin/nvidia-detector          /usr/bin/nvidia-xconfig

Yay there’s hope again . Let’s try now:

me@host:~$ nvidia-container-cli info                   
NVRM version:   384.111
CUDA version:   9.0

Device Index:   0
Device Minor:   0
Model:          Quadro K4200
GPU UUID:       GPU-6e508cda-5398-daa2-a0c4-773fb6c2d753
Bus Location:   00000000:03:00.0
Architecture:   3.0

Ok one step further:

me@host:~$ lxc config set ubuntu1 nvidia.runtime true
me@host:~$ lxc restart ubuntu1

Good the container starts now. Let’s see inside:

me@host:~$ lxc exec ubuntu1 bash
me@container:~$ nvidia-smi
Thu May 10 04:41:15 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4200        Off  | 00000000:03:00.0  On |                  N/A |
| 30%   39C    P8    14W / 110W |     94MiB /  4028MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Looks like it’s working now

So, in effect, the piece of information missing was to apt install nvidia-container-runtime. This was not part of the ASCIInema video and may be confusing for first-time users like me.

I hope this will help other people too.

zertrin · May 10, 2018, 4:50am

I would suggest checking the existence of the binary in the script and outputting a better error message for users like

Error: it seems that you are missing nvidia-container-runtime. Install it first.

I installed via the information from Migration Notice | nvidia-container-runtime but I’m not sure if this is the only way to do so. If that is indeed the only way to do so, I suggest to include the URL in the error message.

stgraber · May 10, 2018, 3:11pm

We do actually have a check for nvidia-container-cli in LXD but it looks like the snap wrapper is effectively making that check pass even when it shouldn’t. Will need to look into that.

James_Guillochon · April 5, 2019, 7:28pm

Hey @stgraber I’m having an issue with this where my network bridge is disabled when I set nvidia.runtime = true. After restarting my container the interface is missing from ifconfig with that option enabled. The NVIDIA GPU appears to be detected when I run nvidia-smi, but since I can’t ssh -X into it anymore it’s kinda useless. Note that the container does start, I’m able to log into it with lxc exec dev bash.

Setting the flag back to false and restarting restores my bridge and ability to ssh.

Here’s what my preseed profile looks like:

networks:
- name: devbr0
  type: bridge
  config:
    ipv4.address: 10.0.2.1/24
    ipv4.nat: "true"
    ipv6.address: none
profiles:
- name: default
  devices:
    eth0:
        name: eth0
        ipv4.address: 10.0.2.42
        parent: devbr0
        type: nic 
        nictype: bridged
    homedir:
        path: /home/james/devel
        source: /home/james/devel
        type: disk
  config:
    raw.idmap: "both 1000 1001"
- name: gui 
  config:
    environment.DISPLAY: :0
    raw.idmap: "both 1000 1001"
    user.user-data: |
      #cloud-config
      runcmd:
        - 'sed -i "s/; enable-shm = yes/enable-shm = no/g" /etc/pulse/client.conf'
        - 'echo export PULSE_SERVER=unix:/tmp/.pulse-native | tee --append /home/ubuntu/.profile'
      packages:
        - x11-apps
        - mesa-utils
        - pulseaudio
  description: GUI LXD profile
  devices:
    PASocket:
      path: /tmp/.pulse-native
      source: /run/user/1000/pulse/native
      type: disk
    X0: 
      path: /tmp/.X11-unix/X0
      source: /tmp/.X11-unix/X0
      type: disk
    mygpu:
      type: gpu 
  name: gui 
  used_by: