Container Failed to retrieve network information

Operating system: Ubuntu 20.04 LTS
LXC version: 4.0.4

I have encountered lately many issues were containers got stuck and the only reliable solution to fix it seems to be rebooting the instance. Of course this is not ideal, so i must find out a more solid fix that doesnt require rebooting.

In our latest episode you can see from syslogs:

Nov 26 03:50:32 lxd-bla-bla-bla lxd.daemon[5911]: t=2021-11-26T03:50:32+0000 lvl=eror msg=“Failed to retrieve network information via netlink” container=xlN-blablacontainer pid=111846
Nov 26 03:50:32 lxd-bla-bla-bla lxd.daemon[5911]: t=2021-11-26T03:50:32+0000 lvl=eror msg=“Error calling 'lxd forknet” container=xlN-blablacontainer err=“Failed to run: /snap/lxd/current/bin/lxd forknet info – 111846 3: Failed setns to container network namespace: No such file or directory” pid=111846

Trying to restart the container just adds the process stuck as well. From lxc operation show ID:

id: 5aa7064d-0069-428c-9be4-251b7e4c6203
class: websocket
description: Executing command
created_at: 2021-11-26T04:07:41.579498862Z
updated_at: 2021-11-26T04:07:41.579498862Z
status: Running
status_code: 103

  • /1.0/containers/xlN-blablacontainer
  • /1.0/instances/xlN-blablacontainer
  • /bin/sh
  • -c
  • /bin/sh -c ‘/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1637899659.1603563-95885-117991861407417/
    && sleep 0’
    HOME: /root
    LANG: C.UTF-8
    PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    USER: root
    “0”: 607757cad07b578e2029838a394baa9a74fba6d82e74c519a9195a52450db1ed
    “1”: b43ac4ad70c5fc0827ee708b6c358c3fb682d1e72072e4bf0073141a7389644e
    “2”: 736d28d6f25ca25971995527e33db40caa299a6aa831f0d1b8efd7c19c9dbaab
    control: 36dae859adf69b51f9df614243bcb68b6cb42812638208c753a5276f0bdeb24b
    interactive: false
    may_cancel: false
    err: “”
    location: none

and container is shown as running in the deamon:
Name: xlN-blablacontainer
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/09/30 20:30 UTC
Status: Running
Type: container
Profiles: default
Pid: 16835
eth0: inet vethecb9cc0a
eth0: inet6 fe80::216:3eff:fec7:5350 vethecb9cc0a
lo: inet
lo: inet6 ::1

Any idea what might be causing this, or how this can get fixed?

Dealing with the restart hanging first, are you using the -f flag on restart to perform a forceful restart?

@tomp yeap, tested that well, restarting also get’s hung in the operation list and doesnt proceed neither with the -f flag.

It could be a disk I/O issue, do you see the problem with other containers?

No, it happens randomly in different server, none of them have any IO issues.

The most important question for now, is, can this issue be fixed without having to restart the machine?