[ Snap: LXD 3.18 ] Difficulty running OpenGL applications in container

avondollen · November 22, 2019, 10:10pm

Background

I’m attempting to containerize my development environment, for ROS development.
For the past year I’ve been using Docker for this task, but I don’t like the layered file-system
approach that docker utilizes. Since I want my container instance to have a longer life, I feel that using LXC for an OS level containerization is the best solution for me.

I’m hoping to have one “ROS-TOOLS” container, and a series of “Project Containers”, each project container will have a catkin workspace in it, while ROS-TOOLS will have mapviz, rviz, roswtf, and gazebo. This will minimize any rosdep conflicts that may arise between my projects.

Some ROS tools require GPU acceleration to function, like viewing lidar data in rviz…
While I’ve been successful in getting the GPU pass-through to work, and getting a cuda program to execute correctly. I’ve been unable to get OpenGL to work…

System Information

OS: Linux Mint 18.3 Sylvia

LXC Version: Client version: 3.18, Server version: 3.18

Which lxc: /snap/bin/lxc

How I’m initializing my container instance

lxc launch --profile default --profile gui --profile nvidia ubuntu:16.04 gui

Profiles

default

$ lxc profile show default
config: {}
description: Default LXD profile
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: default
used_by:
- /1.0/containers/gui

gui

$ lxc profile show gui
config:
  environment.DISPLAY: :0
  raw.idmap: both 1000 1000
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
      - x11-apps
      - mesa-utils
description: Enables X forwarding to host
devices:
  X0:
    path: /tmp/.X11-unix
    source: /tmp/.X11-unix
    type: disk
name: gui
used_by:
- /1.0/containers/gui

nvidia

$ lxc profile show nvidia
config:
  nvidia.driver.capabilities: graphics, compute, display, utility, video
  nvidia.runtime: "true"
description: Enables GPU Pass-through for container
devices:
  Quadro-M100M:
    pci: "0000:01:00.0"
    type: gpu
name: nvidia
used_by:
- /1.0/containers/gui

Permissions

$ cat /etc/subuid
avondollen:1000000:1000
avondollen:100000:65536
avondollen:1000:1
lxd:231072:65536
root:1000:1
root:231072:65536

$ cat /etc/subgid
avondollen:1000000:1000
avondollen:100000:65536
avondollen:1000:1
lxd:231072:65536
root:1000:1
root:231072:65536

What Works

$ lxc exec gui -- nvidia-smi
Fri Nov 22 21:30:39 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M1000M       Off  | 00000000:01:00.0  On |                  N/A |
| N/A   36C    P8    N/A /  N/A |    742MiB /  2002MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

What Doesn’t Work

 $ lxc exec gui -- sudo --login --user ubuntu glxinfo -B
name of display: :0
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
Error: couldn't find RGB GLX visual or fbconfig

Investigation…

Okay so we know that the Nvidia driver is present since ‘nvidia-smi’ is working.
Let’s take a look to see if the Nvidia OpenGL library is being properly linked.

avondollen@Host ~ $ find /usr -iname "*libGL.so*" -exec ls -l -- {} +
lrwxrwxrwx 1 root root      10 Sep 25 18:53 /usr/lib32/nvidia-418/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root      18 Sep 25 18:53 /usr/lib32/nvidia-418/libGL.so.1 -> libGL.so.418.87.01
-rw-r--r-- 1 root root 1275664 Sep 25 01:36 /usr/lib32/nvidia-418/libGL.so.418.87.01
lrwxrwxrwx 1 root root      14 Jun 14  2018 /usr/lib/i386-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root  457256 Jun 14  2018 /usr/lib/i386-linux-gnu/mesa/libGL.so.1.2.0
lrwxrwxrwx 1 root root      10 Sep 25 18:53 /usr/lib/nvidia-418/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root      18 Sep 25 18:53 /usr/lib/nvidia-418/libGL.so.1 -> libGL.so.418.87.01
-rw-r--r-- 1 root root 1275664 Sep 25 01:36 /usr/lib/nvidia-418/libGL.so.418.87.01
lrwxrwxrwx 1 root root      13 Jun 14  2018 /usr/lib/x86_64-linux-gnu/libGL.so -> mesa/libGL.so
lrwxrwxrwx 1 root root      14 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so -> libGL.so.1.2.0
lrwxrwxrwx 1 root root      14 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root  471680 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0

$ lxc exec gui -- find /usr -iname "*libGL.so*" -exec ls -l -- {} +
lrwxrwxrwx 1 root root     14 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 471680 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0

Hmm… I may be mistaken, but it appears that the container does not have access to Nvidia OpenGL libs. Which is odd considering that I have “graphics” flag set for nvidia.driver.capabilities…
Viewing the nvidia-container-runtime documentation is see that the “graphics” flag is for running OpenGL and Vulkan applications.

Considering that ‘compute’ and ‘utility’ flags seem to work, since nvidia-smi and cuda applications are working inside my container I feel that there my be either a bug, and or the graphics flag is just not supported in lxd as of now.

I’ll greatly appreciate any help in clearing this up.

simos · November 23, 2019, 1:28pm

Hi!

It is good that CUDA works (i.e. nvidia-smi gives you the appropriate output). Before venturing with OpenGL, it is good to check first that plain X11 is working. That’s the xclock Xlib application.

If xclock works, then the issue is with GL libs. If xclock does not work, then the issue is with the X11 socket (likely, permissions, or you point to the incorrect socket).

I suggest to replace

devices:
  X0:
    path: /tmp/.X11-unix
    source: /tmp/.X11-unix
    type: disk

with

devices:
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X1
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
  mygpu:
    type: gpu

This simplifies the permissions situations, and also allows you to run GUI snap packages inside the container. Note that the security.* refer to the host.

Give it a go (check xclock, then use a proxy device for the abstract Unix socket) and report back!

avondollen · November 23, 2019, 5:26pm

Hi there!

I’ve verified that xclock does indeed work correctly if and only if I execute it as a non root user.

This works

$ lxc exec gui -- sudo --login --user ubuntu xclock

This does not work

$ lxc exec gui -- xclock
No protocol specified
Error: Can't open display: :0

Without making the suggested changes to my gui profile, this is what is see inside my container at /tmp

lxc exec gui -- ls -alF /tmp
total 8
drwxrwxrwt  7 root   root       7 Nov 23 16:48 ./
drwxr-xr-x 22 root   root      22 Aug 15  2017 ../
drwxrwxrwt  2 root   root       2 Nov 23 16:36 .ICE-unix/
drwxrwxrwt  2 root   root       2 Nov 23 16:36 .Test-unix/
drwxrwxrwt  2 nobody nogroup 4096 Nov 23 16:36 .X11-unix/
drwxrwxrwt  2 root   root       2 Nov 23 16:36 .XIM-unix/
drwxrwxrwt  2 root   root       2 Nov 23 16:36 .font-unix/

After making the suggested changes, testing xclock does not work at all any more.
Here’s the output…

lxc exec gui -- sudo --user ubuntu --login xclock
Error: Can't open display: :0

And looking at the /tmp directory again…

$ lxc exec gui -- ls -alF /tmp
total 5
drwxrwxrwt  7 root root  7 Nov 23 16:54 ./
drwxr-xr-x 22 root root 22 Aug 15  2017 ../
drwxrwxrwt  2 root root  2 Nov 23 16:54 .ICE-unix/
drwxrwxrwt  2 root root  2 Nov 23 16:54 .Test-unix/
drwxrwxrwt  2 root root  2 Nov 23 16:54 .X11-unix/
drwxrwxrwt  2 root root  2 Nov 23 16:54 .XIM-unix/
drwxrwxrwt  2 root root  2 Nov 23 16:54 .font-unix/

Interesting, now is see the .X11-unix/ is being mapped with the root user and group permissions now.
Where as before it had nobody, nogroup. That’s improvement there, but for some reason it can’t open the display @ :0

You mentioned that the security * refers to the host.
I believe is being mapped correctly, since my host user accounts uid=1000/gid=1000, and that’s being mapped to the containers root uid/gid, which from what I remember reading, is how lxd does the mapping with unprivileged containers?

So what could be causing this error with xclock???

You also mentioned that the issue could lie with pointing to the incorrect socket… hmmm

Checking out the /tmp folder on my host shows…

$ ls -alF /tmp/.X11-unix
total 16
drwxrwxrwt  2 root root  4096 Nov 23 11:36 ./
drwxrwxrwt 16 root root 12288 Nov 23 11:49 ../
srwxrwxrwx  1 root root     0 Nov 23 11:36 X0=

So, I decided to change my gui profile to the following…

$  lxc profile show gui
config:
  environment.DISPLAY: :0
  raw.idmap: both 1000 1000
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
      - x11-apps
      - mesa-utils
description: Enables X forwarding to host
devices:
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X0
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
name: gui
used_by:
- /1.0/containers/gui

Now xclock works as both a root user and as the ubuntu user!

$ lxc exec gui -- xclock
$ lxc exec gui -- sudo --user ubuntu --login xclock

Now checking out glxinfo -B

$ lxc exec gui -- glxinfo -B
name of display: :0
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
Error: couldn't find RGB GLX visual or fbconfig

Darn, okay so I guess it’s now time to investigate the OpenGL libs???

Thank you for you’re help in this btw!

simos · November 24, 2019, 7:13am

There was a difference in the configuration relating to X0 and X1. If you have an Nvidia GPU plus an embedded GPU on the motherboard, then the Nvidia is likely X1. But if the Nvidia is the only GPU, then it is the X0.

The last error says that the Nvidia libraries could not be found. You had this working earlier and is unrelated to the new changes you did. Did you apply the Nvidia profile in the latest test?

avondollen · November 24, 2019, 3:54pm

Now we are uncovering the situation on why I’m so confused here.

My system has two GPUs installed…
And LXD see’s both of them…

$ lxc info --resources
...
GPUs:
  Card 0:
    NUMA node: 0
    Vendor: Intel Corporation (8086)
    Product: HD Graphics 530 (191b)
    PCI address: 0000:00:02.0
    Driver: i915 (4.15.0-70-generic)
    DRM:
      ID: 0
      Card: card0 (226:0)
      Control: controlD64 (226:0)
      Render: renderD128 (226:128)
  Card 1:
    NUMA node: 0
    Vendor: NVIDIA Corporation (10de)
    Product: GM107GLM [Quadro M1000M] (13b1)
    PCI address: 0000:01:00.0
    Driver: nvidia (418.87.01)
    DRM:
      ID: 1
      Card: card1 (226:1)
      Render: renderD129 (226:129)
    NVIDIA information:
      Architecture: 5.0
      Brand: Quadro
      Model: Quadro M1000M
      CUDA Version: 10.1
      NVRM Version: 418.87.01
      UUID: GPU-1899bb61-a5bf-8db5-d80a-63f55ac1bfc1
...

This is why inside my Nvidia profile I specify the card via PCI address.
However, on my host system, I only see /tmp/.X11-unix/X0…

$ ls -alF /tmp/.X11-unix/
total 16
drwxrwxrwt  2 root root  4096 Nov 24 09:59 ./
drwxrwxrwt 16 root root 12288 Nov 24 10:14 ../
srwxrwxrwx  1 root root     0 Nov 24 09:59 X0=

In case it matters… I am using an non open-source Nvidia driver.

nvidia-418
Version 418.87.01-0ubuntu1
NVIDIA binary driver

I decided to rebuild my container, after deleting the old one of course…

$ lxc launch --profile default --profile gui --profile nvidia ubuntu:16.04 gui

Verified that xclock still works with root, and ubuntu users.
Verified that nvidia-smi still works as well.

$ lxc exec gui -- nvidia-smi
Sun Nov 24 15:39:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M1000M       Off  | 00000000:01:00.0  On |                  N/A |
| N/A   32C    P8    N/A /  N/A |    374MiB /  2002MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also verified a quick cuda demo program…

$ lxc file push /usr/local/cuda-10.1/extras/demo_suite/bandwidthTest gui/root/
$ lxc exec gui /root/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro M1000M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12112.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12045.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			61677.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

However, when I run glxinfo -B it fails… =(

$ lxc exec gui -- glxinfo -B
name of display: :0
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
Error: couldn't find RGB GLX visual or fbconfig

Reading this thread gave me the idea to attempt to manually mount the Nvidia library paths…

But first let’s see what my container currently has…

Checking for GPU Drivers…

$ lxc exec gui -- ls -alF /dev | grep nvidia
crw-rw-rw-  1 nobody nogroup 195, 254 Nov 24 14:59 nvidia-modeset
crw-rw-rw-  1 nobody nogroup 236,   0 Nov 24 14:59 nvidia-uvm
crw-rw-rw-  1 root   root    195,   0 Nov 24 15:38 nvidia0
crw-rw-rw-  1 nobody nogroup 195, 255 Nov 24 14:59 nvidiactl

That’s good, but not sure if the nobody/nogroup would cause any problems…

Now let’s check the lib paths…

$ lxc exec gui -- ls -alF /usr/lib | grep nvidia

Returns nothing… =(

Looking at what’s on the host now…

avondollen@Host $ ls -alF /usr/lib | grep nvidia
-rw-r--r--   1 root root  1475856 Nov 13 16:28 libnvidia-gtk2.so.440.33.01
-rw-r--r--   1 root root  1484528 Nov 13 16:28 libnvidia-gtk3.so.440.33.01
lrwxrwxrwx   1 root root       53 Oct 31 17:42 libvdpau_nvidia.so -> /etc/alternatives/x86_64-linux-gnu_libvdpau_nvidia.so
drwxr-xr-x   2 root root     4096 Oct 31 17:42 nvidia/
drwxr-xr-x   6 root root    12288 Oct 31 17:42 nvidia-418/
drwxr-xr-x   2 root root     4096 Oct 31 17:41 nvidia-418-prime/
drwxr-xr-x   2 root root     4096 Nov 24  2017 nvidia-prime-applet/

Pretty confident that this is the source of my problem.
So I decided to try and mount the /usr/lib/nvidia-418 folder to the container similar to how we
mounted the /tmp/.X11-unix folder, including the permissions, but it did not work, just caused my Cinnamon Desktop to crash…

simos · November 24, 2019, 6:08pm

I got such a container running (Nvidia GPU, closed-source driver). glxinfo (the minimal application to show OpenGL support or not) works fine.

There are no nvidia libs.

ubuntu@mycontainer:~$ ls -alF /usr/lib | grep nvidia
ubuntu@mycontainer:~$

But there are plenty of library files mounted from the host.

ubuntu@mycontainer:~$ mount | grep -i nvidia | wc -l
51
ubuntu@mycontainer:~$

Such a mount looks like

/dev/sda5 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.430.50 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)

Therefore, check what mount gives to you.

avondollen · November 25, 2019, 4:58am

When I run the provided command inside my container I get the following…

$ lxc exec gui -- mount | grep -i nvidia | wc -l
33

Since that returned something, I’ll refine the search to find libnvidia-opencl

$ lxc exec gui -- mount | grep -i libnvidia-opencl
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)

The mount point looks correct, it matches your example…

Here are the rest of the mounted path’s just in case it might be useful for you.

$ lxc exec gui -- mount | grep -i nvidia
/dev/sda1 on /dev/nvidia0 type ext4 (rw,relatime,errors=remount-ro,data=ordered)
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000)
tmpfs on /etc/nvidia/nvidia-application-profiles-rc.d type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000)
/dev/sda1 on /usr/bin/nvidia-smi type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/bin/nvidia-debugdump type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/bin/nvidia-persistenced type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/bin/nvidia-cuda-mps-control type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/bin/nvidia-cuda-mps-server type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
/dev/sda1 on /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.418.87.01 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)
tmpfs on /run/nvidia-persistenced/socket type tmpfs (rw,nosuid,nodev,noexec,relatime,size=3272140k,mode=755)
udev on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,relatime,size=16329400k,nr_inodes=4082350,mode=755)
udev on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,relatime,size=16329400k,nr_inodes=4082350,mode=755)
udev on /dev/nvidia-modeset type devtmpfs (ro,nosuid,noexec,relatime,size=16329400k,nr_inodes=4082350,mode=755)

When I check the symbolic link’s that should connect the Nvidia driver to the OpenGL Libraries I get the following.

$ lxc exec gui -- find /usr -iname "*libGL.so*" -exec ls -l -- {} + 
lrwxrwxrwx 1 root root     14 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 471680 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0

And when I run the glxinfo -B with some verbosity, I get the following output.

ubuntu@gui:~$ LIBGL_DEBUG=verbose glxinfo -B
name of display: :0
libGL: screen 0 does not appear to be DRI2 capable
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/swrast_dri.so
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/swrast_dri.so
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
Error: couldn't find RGB GLX visual or fbconfig

simos · November 25, 2019, 7:46am

It looks like you got the nouveau driver installed in the container.
I do not get /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0 in the container (note the mesa).
I get /usr/lib/x86_64-linux-gnu/libGL.so.1.0.0 instead (NVidia closed-source).
Can you check whether there are any nouveau packages installed in the container?

While doing so, can you keep notes on how could have happened?

avondollen · November 25, 2019, 4:07pm

This is how I checked to see where the nouveau packages might be coming from.

I created a container with each of my profiles one at time, checked for the openGL libs.

$ lxc launch --profile default ubuntu:16.04 default
$ lxc exec default -- find /usr -iname "*libGL.so*" -exec ls -l -- {} +

Returns nothing as expected.

$ lxc launch --profile default --profile nvidia ubuntu:16.04 default-nvidia
$ lxc exec default-nvidia -- find /usr -iname "*libGL.so*" -exec ls -l -- {} +

Returns nothing as expected.
nvidia-smi works

Since I’m fully expecting that our culprit would be the gui profile, and I suspect that
the container is pulling down the nouveau driver during the mesa-utils installation, I changed the
gui profile to the following.

$ lxc profile show gui
config:
  environment.DISPLAY: :0
  raw.idmap: both 1000 1000
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
      - x11-apps
description: Enables X forwarding to host
devices:
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X0
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
name: gui
used_by:
- /1.0/containers/gui

$ lxc launch --profile default --profile nvidia --profile gui ubuntu:16.04 default-nvidia-gui
$ lxc exec default-nvidia-gui -- find /usr -iname "*libGL.so*" -exec ls -l -- {} +

As expected, the find returns nothing.
nvidia-smi works
xlock works
And as expected glxinfo -B fails because it’s not installed.

ubuntu@default-nvidia-gui:~$ glxinfo -B
The program 'glxinfo' is currently not installed. You can install it by typing:
sudo apt install mesa-utils

So lets install mesa-utils

ubuntu@default-nvidia-gui:~$ sudo apt-get install mesa-utils
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libgl1-mesa-dri libgl1-mesa-glx libglapi-mesa libllvm6.0 libpciaccess0 libsensors4 libtxc-dxtn-s2tc0 libx11-xcb1 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-present0
  libxcb-sync1 libxdamage1 libxshmfence1 libxxf86vm1
Suggested packages:
  lm-sensors
The following NEW packages will be installed:
  libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libgl1-mesa-dri libgl1-mesa-glx libglapi-mesa libllvm6.0 libpciaccess0 libsensors4 libtxc-dxtn-s2tc0 libx11-xcb1 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-present0
  libxcb-sync1 libxdamage1 libxshmfence1 libxxf86vm1 mesa-utils
0 upgraded, 21 newly installed, 0 to remove and 0 not upgraded.
Need to get 20.9 MB of archives.
After this operation, 205 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Ahh, looks like it’s installing libdrm-nouveau2.
And after when the installation is finished we now see openGL drivers…

$ lxc exec default-nvidia-gui -- find /usr -iname "*libGL.so*" -exec ls -l -- {} + 
lrwxrwxrwx 1 root root     14 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 471680 Jun 14  2018 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0

Sadly when trying to uninstall libdrm-nouveau2, it also removes mesa-utils.

simos · November 25, 2019, 5:46pm

Okay, I managed to reproduce.

The short story is that if the container is ubuntu:16.04,
then glxinfo (with the above setup) gives you this trouble.
If you can afford to use ubuntu:18.04 instead, then it happens to work fine without further tinkering.

You could potentially remove this package and still find elsewhere the glxinfo utility. But still, there is no libGL.so.1 library. That is, you can remove the mesa libGL.so.1 with rm but it would still require to find from somewhere.

The long story is that on 18.04, the NVidia library is made available from the libgl1 deb package. Yep, the deb package. Not the NVidia container. Let’s see in practice.

ubuntu@mycontainer1804:~$ ldd `which glxinfo`
	linux-vdso.so.1 (0x00007ffc3d4ca000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/libGL.so.1 (0x00007ff8d320f000)
..
ubuntu@mycontainer1804:~$ dpkg -S libGL.so.1
libgl1:amd64: /usr/lib/x86_64-linux-gnu/libGL.so.1
libgl1:amd64: /usr/lib/x86_64-linux-gnu/libGL.so.1.0.0

What’s going on with 16.04 though?

ubuntu@mycontainer1604:~$ ldd `which glxinfo`
	linux-vdso.so.1 =>  (0x00007fff42da2000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 (0x00007f96d75a2000)
...

But what should really be happening? LXD provides the NVidia container to our container, and the library file is

/dev/sda5 on /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.430.50 type ext4 (ro,nosuid,nodev,relatime,errors=remount-ro,data=ordered)

There should be a link from libGL.so.1 to that library!

ubuntu@mycontainer1604:~$ ldd `which glxinfo`
	linux-vdso.so.1 =>  (0x00007fff80bb2000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 (0x00007fb502762000)
...
ubuntu@mycontainer1604:~$ sudo ln -s /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.430.50 /usr/lib/x86_64-linux-gnu/libGL.so.1
ubuntu@mycontainer1604:~$ sudo ldconfig
ubuntu@mycontainer1604:~$ ldd `which glxinfo`
	linux-vdso.so.1 =>  (0x00007fff80bb2000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 (0x00007fb502762000)
...
ubuntu@mycontainer1604:~$ sudo rm /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf 
ubuntu@mycontainer1604:~$ sudo ldconfig
ubuntu@mycontainer1604:~$ ldd `which glxinfo`
	linux-vdso.so.1 =>  (0x00007ffedfd57000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/libGL.so.1 (0x00007f785b4ca000)
...
ubuntu@mycontainer1604:~$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
OpenGL vendor string: NVIDIA Corporation
...

avondollen · November 25, 2019, 8:10pm

IT WORKS!

Your work around is greatly appreciated! I now have glxgears working!
Thank you so much for your time in helping me with this problem!!

Closing Comments

Unfortunately, our project is built against ROS Kinetic Kame, which is built for Ubuntu 16.04 LTS.
So until then, I’ll be doing work inside Ubuntu Xenial containers, with this work around implemented in my profile.

Here is the profile for other readers to utilize, been tested and works well.
I decided to merge my gui and nvidia profiles together.

$ lxc profile show nvidia
config:
  environment.DISPLAY: :0
  nvidia.driver.capabilities: graphics, compute, display, utility, video
  nvidia.runtime: "true"
  raw.idmap: both 1000 1000
  user.user-data: |
    #cloud-config
    package_upgrade: true
    packages:
      - x11-apps
      - mesa-utils
    runcmd:
      - 'ln -s /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.87.01 /usr/lib/x86_64-linux-gnu/libGL.so.1'
      - 'ldconfig'
      - 'rm /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf'
      - 'ldconfig'
description: Enables GPU Pass-Through & X Forwarding
devices:
  Quadro-M100M:
    pci: "0000:01:00.0"
    type: gpu
  X0:
    bind: container
    connect: unix:@/tmp/.X11-unix/X0
    listen: unix:@/tmp/.X11-unix/X0
    security.gid: "1000"
    security.uid: "1000"
    type: proxy
name: nvidia
used_by:
- /1.0/containers/gui

Keeping in mind that this profile now makes the container driver specific…
I’m open to better alternatives to easily implement this work around for quick container launches.
For the time being, this will do nicely.