Got "Script exited with status 1" when starting a container with "nvidia.runtime: true"

runapp · November 17, 2021, 7:29am

I’ve installed nvidia-container-tools and libnvidia-container. nvidia-container-cli info can show GPU correctly. I also installed lxc 4.0.5 to get the nvidia hook (/usr/share/lxc/hooks/nvidia).

However, when I start a container with nvidia.runtime: true, it fails and shows an error of (output from lxc info --show-log nvtest)

ERROR    conf - conf.c:run_buffer:324 - Script exited with status 1
ERROR    conf - conf.c:lxc_setup:3374 - Failed to run mount hooks
ERROR    start - start.c:do_start:1218 - Failed to setup container "nvtest"
ERROR    sync - sync.c:__sync_wait:36 - An error occurred in another process (expected sequence number 5)
WARN     network - network.c:lxc_delete_network_priv:3185 - Failed to rename interface with index 3 from "eth0" to its initial name "vethe1e468a6"
ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:860 - Received container state "ABORTING" instead of "RUNNING"
ERROR    start - start.c:__lxc_start:1999 - Failed to spawn container "nvtest"
WARN     start - start.c:lxc_abort:1018 - No such process - Failed to send SIGKILL to 45644

I’ve confirmed that by setting nvidia.runtime to false, the container can start. I also tried to run /usr/share/lxc/hooks/nvidia directly under bash, and the exit code is 0.

Any chance to get further details on what happened to the hook?

The container only has one nic device. Other config are all defaults. (Except nvidia.runtime of course )

PS: lxc info --resources and nvidia-smi all can correctly show the GPU. And a small pytorch example can also run on the host.