Catostrophic Failure (failed to run /snap/lxd ... forkstart) with nvidia.runtime: "true"

Good evening:
I was fooling around with a profile (to config a routed container) and my LXD instance crashed. No worries except now none of my containers start upon restart. They all fail with the following:

Error: Failed to run: /snap/lxd/current/bin/lxd forkstart InternetSuite /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/InternetSuite/lxc.conf: 
Try `lxc info --show-log InternetSuite` for more info

lxc info --show-log InternetSuite shows:

Name: InternetSuite
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/01/10 08:42 EST
Last Used: 2022/02/15 18:37 EST

Log:

lxc InternetSuite 20220215233731.117 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc InternetSuite 20220215233731.117 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc InternetSuite 20220215233731.117 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc InternetSuite 20220215233731.117 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc InternetSuite 20220215233731.287 ERROR    conf - conf.c:run_buffer:321 - Script exited with status 1
lxc InternetSuite 20220215233731.287 ERROR    conf - conf.c:lxc_setup:4395 - Failed to run mount hooks
lxc InternetSuite 20220215233731.287 ERROR    start - start.c:do_start:1275 - Failed to setup container "InternetSuite"
lxc InternetSuite 20220215233731.287 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc InternetSuite 20220215233731.292 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "veth5e26a6a5"
lxc InternetSuite 20220215233731.292 ERROR    start - start.c:__lxc_start:2074 - Failed to spawn container "InternetSuite"
lxc InternetSuite 20220215233731.292 WARN     start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 17 for process 13029
lxc InternetSuite 20220215233731.292 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:877 - Received container state "ABORTING" instead of "RUNNING"
lxc InternetSuite 20220215233736.368 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc InternetSuite 20220215233736.368 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc 20220215233736.404 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220215233736.404 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors for command "get_state"

here is another container:

lxc start steam                  
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart steam /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/steam/lxc.conf: 
Try `lxc info --show-log steam` for more info

lxc info --show-log steam shows:

Name: steam
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2021/12/11 20:28 EST
Last Used: 2022/02/15 19:03 EST

Log:

lxc steam 20220216000358.981 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc steam 20220216000358.981 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc steam 20220216000358.981 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc steam 20220216000358.981 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc steam 20220216000359.134 ERROR    conf - conf.c:run_buffer:321 - Script exited with status 1
lxc steam 20220216000359.134 ERROR    conf - conf.c:lxc_setup:4395 - Failed to run mount hooks
lxc steam 20220216000359.134 ERROR    start - start.c:do_start:1275 - Failed to setup container "steam"
lxc steam 20220216000359.134 ERROR    sync - sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc steam 20220216000359.139 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 0 from "eth0" to its initial name "veth8f771e95"
lxc steam 20220216000359.140 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:877 - Received container state "ABORTING" instead of "RUNNING"
lxc steam 20220216000359.140 ERROR    start - start.c:__lxc_start:2074 - Failed to spawn container "steam"
lxc steam 20220216000359.140 WARN     start - start.c:lxc_abort:1039 - No such process - Failed to send SIGKILL via pidfd 17 for process 36316
lxc steam 20220216000404.240 WARN     conf - conf.c:lxc_map_ids:3588 - newuidmap binary is missing
lxc steam 20220216000404.240 WARN     conf - conf.c:lxc_map_ids:3594 - newgidmap binary is missing
lxc 20220216000404.272 ERROR    af_unix - af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20220216000404.272 ERROR    commands - commands.c:lxc_cmd_rsp_recv_fds:127 - Failed to receive file descriptors for command "get_state"

Any help would be greatly appreciated :confused:

Anything useful in /var/snap/lxd/common/lxd/logs/lxd.log?

Nothing of particular note. However I will note that both containers for which I provided the error information above have the nvidia.runtime = true flag set

Ah could be the same as Nvidia.runtime enabled containers suddenly not able to start then.

That is what I am thinking. I am able to start non-x11 containers (though, admittedly, they are new from scratch as I took this as an opportunity to optimize my routed profiles to include host_name which seems nigh impossible to add after creation of the container)

Its off topic, but using profiles to apply per-NIC routed host_name doesn’t seem appropriate given that the purpose of profiles is to apply the same config to multiple instances and by definition the host_name setting needs to be unique per instance per NIC to avoid conflicts on the host.

Thomas, oh I am quite aware haha … I was walking my dogs last night and was contemplating my LXD predicament and thought to myself “you know, I am not really using profiles correctly” (for instance, my plex server profile could not be used for any other container). However, its what finally worked for me vs. mucking around with netplan after the fact and getting double-nics and routing failures (the cause of which I have still yet to figure out).

So I hear you but am at a “if it aint broke, don’t fix it” stage for this particular issue

Fair enough :slight_smile:
Well for future reference you can just do lxc config device override <instance> <device> setting=value... and it will copy the device from the profile into the instance and modify it with the settings you want to override. Then after that you an use lxc config device set <instance> <device> setting=value to change them on the instance later.

Awesome; thank you!

Can verify sudo snap refresh --channel=stable lxd has cured my nvidia.runtime = true woes and subsequently the issues in this thread. I am marking “solution”

Thank you @tomp and @stgraber !

1 Like