GUI/X11 containers not working after GPU swap

isolin · March 29, 2020, 6:44pm

Hi, this is my first post. I discovered lxd and lxc containers two years ago and since then I have been a happy user promoting them among my colleagues. Great job guys, thank you for creating and maintaining it! I was able to find solutions to all minor problems so far, but now I am really lost. Please help.

I am running Ubuntu 18.04 on the host machine. My GUI profile is built folowing this post and so far it worked great.

My graphics card deceased today, it was a GTX 480. It was running with the 390.x legacy driver from NVidia. As a quick replacement I found home a GTX 280, however that required me to downgrade the driver to 340.x. Since then, the gui-enabled containers won’t start. Those using only the default profile work fine. I also tried to use the X11 profile to launch a new container, however it does not even start downloading the image, it stop immediately with the same error.

Creating gtx280
Starting gtx280
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart gtx280 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/gtx280/lxc.conf: 
Try `lxc info --show-log local:gtx280` for more info

So looking at lxc info --show-log local:gtx280 gives

Name: gtx280
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/03/29 18:02 UTC
Status: Stopped
Type: container
Profiles: default, x11

Log:

lxc gtx280 20200329180234.355 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1136 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.gtx280"
lxc gtx280 20200329180234.356 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1136 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.gtx280"
lxc gtx280 20200329180234.358 ERROR    utils - utils.c:lxc_can_use_pidfd:1834 - Kernel does not support pidfds
lxc gtx280 20200329180235.357 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 1
lxc gtx280 20200329180235.357 ERROR    conf - conf.c:lxc_setup:3373 - Failed to run mount hooks
lxc gtx280 20200329180235.357 ERROR    start - start.c:do_start:1232 - Failed to setup container "gtx280"
lxc gtx280 20200329180235.357 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc gtx280 20200329180235.359 WARN     network - network.c:lxc_delete_network_priv:3213 - Failed to rename interface with index 0 from "eth0" to its initial name "veth830970ab"
lxc gtx280 20200329180235.359 ERROR    start - start.c:__lxc_start:1947 - Failed to spawn container "gtx280"
lxc gtx280 20200329180235.359 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:852 - Received container state "ABORTING" instead of "RUNNING"
lxc gtx280 20200329180235.359 WARN     start - start.c:lxc_abort:1030 - No such process - Failed to send SIGKILL to 4193
lxc 20200329180235.476 WARN     commands - commands.c:lxc_cmd_rsp_recv:122 - Connection reset by peer - Failed to receive response for command "get_state"

I suppose I need to tell lxd somehow that the graphics driver changed. Can you please help me how to accomplish that?

Just to be sure, I hope the 340.x is supported, isn’t it?

Btw installing libpam-cgfs did not help.

simos · March 29, 2020, 7:25pm

Hi!

The error messages do not indicate an issue specific to the GPU, which is really weird.
It looks as if something else it wrong.

Looking through the logs, it appears you are running LXD 3.0.x (DEB package), preinstalled with Ubuntu 18.04. Is that the case?

You mention that lxc launch ubuntu:18.04 mycontainer works, but lxc launch ubuntu:18.04 mycontainer --profile default --profile x11 gives the above errors.
Can you please

Create a copy of the x11 profile (lxc profile copy x11 mytest).
Start removing configuration from the mytest profile, one by one, and test by creating a container, with lxc launch ubuntu:18.04 mycontainer --ephemeral --profile default --profile mytest. First remove the mygpu device, then remove the reference to nvidia.runtime.

isolin · March 29, 2020, 7:30pm

Thank you for the quick answer!

For the first part, I am running the lxd 3.23 rev 14066 stable snap package.

I will now try the removal as you suggested…

isolin · March 29, 2020, 7:43pm

It launches a new container if I also remove the nvidia.runtime. If I only remove the mygpu device it fails just as before:

Name: gtx280
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/03/29 19:52 UTC
Status: Stopped
Type: container
Profiles: default, x11gtx280

Log:

lxc gtx280 20200329195248.828 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1136 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.gtx280"
lxc gtx280 20200329195248.829 ERROR    cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1136 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.gtx280"
lxc gtx280 20200329195248.831 ERROR    utils - utils.c:lxc_can_use_pidfd:1834 - Kernel does not support pidfds
lxc gtx280 20200329195249.754 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 1
lxc gtx280 20200329195249.754 ERROR    conf - conf.c:lxc_setup:3373 - Failed to run mount hooks
lxc gtx280 20200329195249.754 ERROR    start - start.c:do_start:1232 - Failed to setup container "gtx280"
lxc gtx280 20200329195249.754 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)
lxc gtx280 20200329195249.758 WARN     network - network.c:lxc_delete_network_priv:3213 - Failed to rename interface with index 0 from "eth0" to its initial name "veth89cf4e96"
lxc gtx280 20200329195249.758 ERROR    start - start.c:__lxc_start:1947 - Failed to spawn container "gtx280"
lxc gtx280 20200329195249.758 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:852 - Received container state "ABORTING" instead of "RUNNING"
lxc gtx280 20200329195249.758 WARN     start - start.c:lxc_abort:1030 - No such process - Failed to send SIGKILL to 11725
lxc 20200329195249.886 WARN     commands - commands.c:lxc_cmd_rsp_recv:122 - Connection reset by peer - Failed to receive response for command "get_state"

isolin · March 29, 2020, 8:08pm

Next, I tried the following:

I took the original x11 profile and only changed nvidia.runtime to false keeping the mygpu device. It successfully launches a new container and running xclock inside shows the clock. Perhaps it is an issue with the legacy 340 driver after all?

Update: Deactivating the nvidia.runtime in my old X11 profile allowed me to run all the containers as before. I didn’t try GPGPU, but I bet i wouldn’t work.

@simos Thank you very much for help!

simos · March 29, 2020, 11:29pm

Without the NVidia runtime, you are able to run programs like xclock (pure X11 application) but not glxinfo/glxgears (requires OpenGL libraries). Can you test for this?

If that is the case, then I suppose the problem relates to nvidia.runtime not working. LXD shares a set of files from the host to the container (the runtime). It might be worthy of a bug report but needs effort to untangle.

If you require hardware acceleration in the container, and nvidia.runtime does not work, you can try doing it the old way; identify the exact nvidia package on the host, and install it in the container. For example, nvidia-driver-390. By doing so, this will install the whole lot (including kernel driver which is unusable in a container), but you will get the runtime libraries which you need.

isolin · March 30, 2020, 12:39am

Yes, I can confirm that glxinfo and glxgears do not run.

name of display: :0
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
X Error of failed request:  GLXBadContext
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  6 (X_GLXIsDirect)
  Serial number of failed request:  56
  Current serial number in output stream:  55

It looks like so far I didn’t need OpenGL in any of the containers, but I will run into that issue in a few days. I will give it a try installing the driver internally and let you know. Should I then re-enable the nvidia.runtime as well? I guess not.

isolin · March 30, 2020, 3:54pm

I am trying to get OpenGL in the containers running again, bu no success yet.

I was able to enable the nvidia.runtime in the profile again but instead of all I had to remove the display property. All the others work, but the display seems to be the problem. However, glxinfo needs the display.
I switched off the nvidia.runtime again and installed the driver inside of the container, but it had no effect to the glxinfo error. Still no OpenGL.

What would be the next steps to investigate the problem please (I am far from a linux pro)?

simos · March 30, 2020, 7:46pm

I had a look at the reported Issues on LXD, and I think the issue you are facing is similar to https://github.com/lxc/lxd/issues/6262

Specifically, how did you install the NVidia 340 driver? I think Ubuntu 18.04 does not provide such a driver.
If it is not provided officially by Ubuntu, then that is the issue.

isolin · March 30, 2020, 8:10pm

Thank you for taking care of my struggle!

I installed the driver the most straight forward way sudo apt install nvidia-340. It is included in the bionic repository. But I have to note that due to the GPU swap I had to uninstall nvidia-390 first. So to clarify the timeline:

1 year ago I installed lxd with a GTX 480 inside and nvidia-390 driver running and created all the containers I am using
the 4xx generation is the oldest supported by nvidia-390
after the 480 GTX deceased and I replaced it by an even older 280 GTX, I also had to remove the nvidia-390 driver and install the nvidia-340 in order to get it running

I looked at the issue you referenced. The output of lxc info --resources looks normal. What might have been wrong is that dpkg -l | grep nvidia showed two relicts of the 390 driver. I purged them and I am going to reboot now to see if it helped.

isolin · March 30, 2020, 8:41pm

Update: It did not help

simos · March 31, 2020, 12:03am

When you install nvidia-driver-340 in the container, you may need to rename one of the shared libraries for GL so that the container uses the one from NVidia, not the one from mesa. I do not have the details handy, and you would probably need to look into the forum’s old posts.

Having said that, it is more useful for the community to figure out what is really going on. Whether nvidia-driver-340 actually works with nvidia.runtime. Or, whether LXD is caching some files from the initial host’s GPU driver, so when you switch driver, some remnants happen to stay there.

If you can start fresh on LXD + NVidia-340 (for example, use a new disk to install Ubuntu+LXD+Nvidia340), then you can be sure whether this combination is supposed to work.

In any case, you want nvidia.runtime to work properly in LXD since you are using a supported Bionic driver.

You can file an issue at https://github.com/lxc/lxd/issues

isolin · March 31, 2020, 12:26am

I can definitely confirm that it is the driver. I have an older machine home, with a GPU even one generation older and it also runs 18.04 Server with the nvidia-340 driver. It had the default pre-installed, lxd so I switched for the most recent snap and did the tests. I got exactly the same error.

I will create an issue at GitHub tomorrow. Thank you very much for your help.