I’ve installed nvidia-container-tools
and libnvidia-container
. nvidia-container-cli info
can show GPU correctly. I also installed lxc 4.0.5
to get the nvidia hook (/usr/share/lxc/hooks/nvidia
).
However, when I start a container with nvidia.runtime: true
, it fails and shows an error of (output from lxc info --show-log nvtest
)
ERROR conf - conf.c:run_buffer:324 - Script exited with status 1
ERROR conf - conf.c:lxc_setup:3374 - Failed to run mount hooks
ERROR start - start.c:do_start:1218 - Failed to setup container "nvtest"
ERROR sync - sync.c:__sync_wait:36 - An error occurred in another process (expected sequence number 5)
WARN network - network.c:lxc_delete_network_priv:3185 - Failed to rename interface with index 3 from "eth0" to its initial name "vethe1e468a6"
ERROR lxccontainer - lxccontainer.c:wait_on_daemonized_start:860 - Received container state "ABORTING" instead of "RUNNING"
ERROR start - start.c:__lxc_start:1999 - Failed to spawn container "nvtest"
WARN start - start.c:lxc_abort:1018 - No such process - Failed to send SIGKILL to 45644
I’ve confirmed that by setting nvidia.runtime
to false, the container can start. I also tried to run /usr/share/lxc/hooks/nvidia
directly under bash, and the exit code is 0.
Any chance to get further details on what happened to the hook?
The container only has one nic device. Other config are all defaults. (Except nvidia.runtime
of course )
PS: lxc info --resources
and nvidia-smi
all can correctly show the GPU. And a small pytorch example can also run on the host.