GPU containers do not start

Hi, my snaps refreshed today, currently lxd is at 5.12-3564c08 rev 24615 in latest/candidate. There is some problem again with gpu enabled containers. At start I’m getting errors like

lxc mycontainer 20230317071925.749 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc mycontainer 20230317071925.749 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc mycontainer 20230317071925.750 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3621 - newuidmap binary is missing
lxc mycontainer 20230317071925.750 WARN     conf - ../src/src/lxc/conf.c:lxc_map_ids:3627 - newgidmap binary is missing
lxc mycontainer 20230317071925.750 WARN     cgfsng - ../src/src/lxc/cgroups/cgfsng.c:fchowmodat:1619 - No such file or directory - Failed to fchownat(42, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc mycontainer 20230317071925.953 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc mycontainer 20230317071925.953 ERROR    conf - ../src/src/lxc/conf.c:lxc_setup:4437 - Failed to run mount hooks
lxc mycontainer 20230317071925.953 ERROR    start - ../src/src/lxc/start.c:do_start:1272 - Failed to setup container "mycontainer"
lxc mycontainer 20230317071925.953 ERROR    sync - ../src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc mycontainer 20230317071925.956 WARN     network - ../src/src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from "eth0" to its initial name "veth83025810"
lxc mycontainer 20230317071925.956 ERROR    lxccontainer - ../src/src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state "ABORTING" instead of "RUNNING"
lxc mycontainer 20230317071925.956 ERROR    start - ../src/src/lxc/start.c:__lxc_start:2107 - Failed to spawn container "mycontainer"
lxc mycontainer 20230317071925.956 WARN     start - ../src/src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 43 for process 9546
lxc 20230317071926.863 ERROR    af_unix - ../src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20230317071926.863 ERROR    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

What OS are you running on the host?

LXD 5.12 (currently in candidate) is when we’ll transition from core20 (Ubuntu 20.04) to core22 (Ubuntu 22.04) as the base for the snap. In our tests, we’ve not seen any problem with NVIDIA runtime for our GPUs but our tests use an Ubuntu 22.04 host system.

As the NVIDIA runtime logic needs to load some libraries from your host system, it’s not impossible that this now breaks with older distros.

I have 20.04.5 and 470.161.03 NVidia driver.

We’ll have to test with 20.04, but I suspect that may be the issue…