Nvidia.runtime enabled containers suddenly not able to start

I am running several GUI enabled containers following the famous guide by simos. For almost one year I had not problems running GUI apps as well as headless CUDA programs, but today none of the nvidia.runtime enabled containers can start. I did no manual apt upgrade recently, perhaps there was an update to the LXD snap in the background. The error log lxc info --show-log for my container called cuda shows the following:

lxc cuda 20220215163906.960 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc cuda 20220215163906.960 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc cuda 20220215163906.966 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc cuda 20220215163906.966 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc cuda 20220215163906.969 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1251 - No such file or directory - Failed to fchownat(42, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc cuda 20220215163906.227 ERROR    conf - conf.c:run_buffer:321 - Script exited with status 1
lxc cuda 20220215163906.227 ERROR    conf - conf.c:lxc_setup:4395 - Failed to run mount hooks
lxc cuda 20220215163906.227 ERROR    start - start.c:do_start:1275 - Failed to setup container "cuda"
lxc cuda 20220215163906.227 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc cuda 20220215163906.231 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "veth5b1a67f5"
lxc cuda 20220215163906.231 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:877 - Received container state "ABORTING" instead of "RUNNING"
lxc cuda 20220215163906.231 ERROR    start - start.c:__lxc_start:2074 - Failed to spawn container "cuda"
lxc cuda 20220215163906.231 WARN     start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 43 for process 20297
lxc cuda 20220215163911.317 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc cuda 20220215163911.317 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc 20220215163911.354 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220215163911.354 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors for command "get_state"

My LXD version (snap) is 4.23, NVidia driver 470.103.01 all running at Ubuntu 20.04.3 with Linux 5.13.0-28-generic x86_64 kernel.

I will be glad for any advice how to fix this.

nvidia-container was updated to a new version with 4.23, I wonder if that’s causing some issue. I’ll try to replicate your issue here and if that’s the problem, either find a workaround or we’ll just plain revert to the older nvidia-container version.

1 Like

Pushed a tentative fix, if it works, we’ll start the rollout to stable in a couple of hours.

Should I backup my containers before testing it?

Nope, I’ll do the testing on my side once the build is complete (takes 30-45min) and if it’s good, I’ll do the push to stable, nothing for you to do really.

The issue is that the newer nvidia-container uses an additional library which we were missing so hopefully will be a simple fix.

OK thank you very much for the quick fix! I will let you know in case there are still issues tomorrow, otherwise I just mark it as solved.

Fix confirmed to work. Waiting for the arm64 build to be done then will kick the progressive rollout to stable. It can take up to 48h to hit everyone though.

Rollout in progress, currently at 50% of users so there’s a good chance you’ll get it.

Can verify sudo snap refresh --channel=stable lxd has cured my nvidia.runtime = true woes

Confirming as well, thank you very much!