Fail to start container

hubutui · October 4, 2019, 2:01am

My container fail to start, here is the log:

$ lxc info --show-log wangwei-container
Name: wangwei-container
Location: none
Remote: unix://
Architecture: x86_64
Created: 2019/08/30 09:37 UTC
Status: Stopped
Type: persistent
Profiles: default

Log:

lxc wangwei-container 20191004015455.431 WARN     cgfsng - cgroups/cgfsng.c:chowmod:1525 - No such file or directory - Failed to chown(/sys/fs/cgroup/unified//lxc.payload/wangwei-container/memory.oom.group, 655360, 0)
lxc wangwei-container 20191004015455.736 ERROR    conf - conf.c:run_buffer:352 - Script exited with status 1
lxc wangwei-container 20191004015455.736 ERROR    conf - conf.c:lxc_setup:3653 - Failed to run mount hooks
lxc wangwei-container 20191004015455.736 ERROR    start - start.c:do_start:1321 - Failed to setup container "wangwei-container"
lxc wangwei-container 20191004015455.736 ERROR    sync - sync.c:__sync_wait:61 - An error occurred in another process (expected sequence number 5)
lxc wangwei-container 20191004015455.736 WARN     network - network.c:lxc_delete_network_priv:3372 - Failed to rename interface with index 5 from "eth0" to its initial name "mac8e61d8a5"
lxc wangwei-container 20191004015455.736 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:872 - Received container state "ABORTING" instead of "RUNNING"
lxc wangwei-container 20191004015455.737 ERROR    start - start.c:__lxc_start:2036 - Failed to spawn container "wangwei-container"
lxc 20191004015456.332 WARN     commands - commands.c:lxc_cmd_rsp_recv:134 - Connection reset by peer - Failed to receive response for command "get_state"

And here is my profile:

config:
  nvidia.runtime: "true"
description: Default LXD profile
devices:
  eth0:
    name: eth0
    nictype: macvlan
    parent: enp5s0
    type: nic
  gpu:
    type: gpu
  root:
    path: /
    pool: default
    type: disk
name: default

And the output of lxc config show --expanded wangwei-container:

architecture: x86_64
config:
  nvidia.runtime: "true"
  raw.idmap: |-
    uid 1014 1000
    gid 1017 1000
  volatile.base_image: aaceca3b757faa89d9d633d5e4a24fba0fd258a9bfea0271bbab6313323efb32
  volatile.eth0.hwaddr: 00:16:3e:c6:4f:aa
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":true,"Isgid":false,"Hostid":1014,"Nsid":1000,"Maprange":1},{"Isuid":true,"Isgid":false,"Hostid":101001,"Nsid":1001,"Maprange":654359},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":false,"Isgid":true,"Hostid":1017,"Nsid":1000,"Maprange":1},{"Isuid":false,"Isgid":true,"Hostid":101001,"Nsid":1001,"Maprange":654359}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":true,"Isgid":false,"Hostid":1014,"Nsid":1000,"Maprange":1},{"Isuid":true,"Isgid":false,"Hostid":101001,"Nsid":1001,"Maprange":654359},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":false,"Isgid":true,"Hostid":1017,"Nsid":1000,"Maprange":1},{"Isuid":false,"Isgid":true,"Hostid":101001,"Nsid":1001,"Maprange":654359}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":true,"Isgid":false,"Hostid":1014,"Nsid":1000,"Maprange":1},{"Isuid":true,"Isgid":false,"Hostid":101001,"Nsid":1001,"Maprange":654359},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":1000},{"Isuid":false,"Isgid":true,"Hostid":1017,"Nsid":1000,"Maprange":1},{"Isuid":false,"Isgid":true,"Hostid":101001,"Nsid":1001,"Maprange":654359}]'
  volatile.last_state.power: STOPPED
devices:
  eth0:
    name: eth0
    nictype: macvlan
    parent: enp5s0
    type: nic
  gpu:
    type: gpu
  home:
    path: /home/ubuntu/wangwei
    source: /home/wangwei
    type: disk
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

I’m using LXD 3.17 on ArchLinux, kernel 5.3.1, btrfs storage backend.

stgraber · October 4, 2019, 2:06am

Try unsetting nvidia.runtime, this kind of odd startup behavior can often be tracked down to something being a bit broken with the nvidia driver/libraries.

hubutui · October 4, 2019, 2:14am

Yeah, after unsetting nvidia.runtime and remove the gpu device, I could start the container now. But I need this for deep learning training.

stgraber · October 4, 2019, 3:06am

Can you check if it’s just nvidia.runtime that’s the problem?
The GPU device itself should be fine.

If that’s the case, then we can try to figure out what’s going on with nvidia-container.

hubutui · October 4, 2019, 3:19am

Yes, I only unset nvidia.runtime, and I could start the container.

hubutui · October 4, 2019, 3:29am

I downgrade linux and linux-headers to 5.1.1, and install nvidia-dkms instead of nvidia for now. But still the same issue.

hubutui · October 4, 2019, 10:45am

I reinstall the host system to Ubuntu 18.04, this make all my Linux server running the same OS and save my life.