Unknown error -17 - Failed to setup ipv4 address route for network device

I have a Debian 10 container, with routed network interface, today I restarted the container and it never went up again, when I tried to delete the container and recreate it again I got the same error until I removed the IP from the routes:

route del -net 138.*.16.151 netmask 255.255.255.255

I am currently testing LXD for production, but how it is possible that this could happen ? I do not have the original container that broke after the restart unfortunately. But I would really like to know how to prevent this issue from happening in the future, because this issue is preventing me to use LXD in production.

This is how I create the container:
lxc init images:debian/buster c20
lxc config set c20 limits.cpu 1
lxc config set c20 limits.cpu.allowance 12%
lxc config set c20 limits.memory 1024MB
lxc config set c20 limits.memory.swap false
lxc config device add c20 root disk pool=default path=/
lxc config device set c20 root size 20GB
lxc config device set c20 root limits.read 10000MB
lxc config device set c20 root limits.write 10000MB
lxc start c20

After that I setup /etc/network/interfaces file inside the container:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 138.*.16.151/32

Finally I add routed device to the container:

lxc stop c20
lxc config device add c20 eth0 nic nictype=routed parent=eth0 ipv4.address=138.*.16.151
lxc start c20

Here the error pops up:

lxc info --show-log c20
Name: c20
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/12/15 17:10 UTC
Status: Stopped
Type: container
Profiles: default

Log:

lxc c20 20201215172105.724 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.monitor.c20"
lxc c20 20201215172105.724 WARN     cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset//lxc.payload.c20"
lxc c20 20201215172105.725 ERROR    utils - utils.c:lxc_can_use_pidfd:1846 - Kernel does not support pidfds
lxc c20 20201215172105.726 WARN     cgfsng - cgroups/cgfsng.c:fchowmodat:1573 - No such file or directory - Failed to fchownat(17, memory.oom.group, 1000000000, 0, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW )
lxc c20 20201215172105.730 ERROR    network - network.c:setup_ipv4_addr_routes:163 - Unknown error -17 - Failed to setup ipv4 address route for network device with eifindex 102
lxc c20 20201215172105.730 ERROR    network - network.c:instantiate_veth:422 - Unknown error -17 - Failed to setup ip address routes for network device "veth931b90e0"
lxc c20 20201215172105.756 ERROR    network - network.c:lxc_create_network_priv:3068 - Unknown error -17 - Failed to create network device
lxc c20 20201215172105.756 ERROR    start - start.c:lxc_spawn:1786 - Failed to create the network
lxc c20 20201215172105.756 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:860 - Received container state "ABORTING" instead of "RUNNING"
lxc c20 20201215172105.757 ERROR    start - start.c:__lxc_start:1999 - Failed to spawn container "c20"
lxc c20 20201215172105.757 WARN     start - start.c:lxc_abort:1018 - No such process - Failed to send SIGKILL to 23986
lxc 20201215172105.798 WARN     commands - commands.c:lxc_cmd_rsp_recv:126 - Connection reset by peer - Failed to receive response for command "get_state"

Can you recreate the error?

I will try to, but it happened to the container that was running like 20 days without any interactions, today I tried to restart it and it could not start because of this. I create and delete all my containers via scripts so I do that always the same way.

Thanks,

Do you have the error logs from the container that failed to clean up the routes, rather than the errors caused when trying to start it again. As that would hopefully help shed some light on the issue.

Also what is the host OS & version and LXD version?

If you run the snap there may be something in journalctl or /var/snap/lxd/common/lxd/logs/lxd.log

I have just this, but I guess it it useless, that’s where I tried to restart the broken container that had assigned the IP for like 20 days or so and today failed to start:

t=2020-12-15T17:28:31+0100 lvl=info msg=“Starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-11-15T13:23:42+0100
t=2020-12-15T17:28:31+0100 lvl=eror msg=“Failed starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-11-15T13:23:42+0100
t=2020-12-15T17:28:32+0100 lvl=info msg=“Shut down container” action=stop created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:28:31+0100
t=2020-12-15T17:29:24+0100 lvl=info msg=“Starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:28:31+0100
t=2020-12-15T17:29:24+0100 lvl=eror msg=“Failed starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:28:31+0100
t=2020-12-15T17:29:25+0100 lvl=info msg=“Shut down container” action=stop created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:29:24+0100
t=2020-12-15T17:30:07+0100 lvl=info msg=“Starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:29:24+0100
t=2020-12-15T17:30:07+0100 lvl=eror msg=“Failed starting container” action=start created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:29:24+0100
t=2020-12-15T17:30:08+0100 lvl=info msg=“Shut down container” action=stop created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default stateful=false used=2020-12-15T17:30:07+0100
t=2020-12-15T17:32:21+0100 lvl=info msg=“Deleting container” created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default used=2020-12-15T17:30:07+0100
t=2020-12-15T17:32:22+0100 lvl=info msg=“Deleted container” created=2020-11-15T13:23:37+0100 ephemeral=false name=c19 project=default used=2020-12-15T17:30:07+0100

Thanks. It occurred to me that if the static route still exists then this means that the host side of the veth pair also still exists (because if it didn’t exist then the static route would also be removed).

While liblxc should remove the container side veth interface on shutdown (which in turn should remove the host side veth interface), and indeed in my tests this is what happens, I think we could do a more thorough job of detecting if the host-side interface still exists in LXD when the NIC device is stopped and try to remove it.

This is what the PR does: