LXD: Network interfaces get renamed, container restart fails

Crowley007 · April 8, 2022, 7:49am

Hello,

I’ve had this problem as long as I can remember. Now using Ubuntu 22.04 Beta and LXD 5.

I pass through all my physical NICs to a Openwrt container. This works fine the first time I boot the host. But when I try to restart the openwrt container with “lxc restart”, some of the parent NICs get renamed to something seemingly random like phys****** and lxc fails to start as parent NIC does not exist.

The phys****** adapter does have the correct MAC and has a property “altname” which does have the real interface name.

Would it be possible to passthrough the NIC with MAC instead of Ifname? Or is there anything else I could do to stop this behavior? I tried disabling Predictable Network Interface Names, but with this the NICs get renamed phys****** during host boot and the container wont start once.

tomp · April 8, 2022, 8:30am

Is it always the same parent NICs that get renamed?
Are there any conflicting interfaces on the parent when the container gets restarted?

Please can you show the output of lxc config show <instance> --expanded along with the output of ip a before and after the container has been started and then restarted.

tomp · April 8, 2022, 8:30am

This isn’t possible at this time, we should focus on fixing the bug thats causes them to be renamed.

Crowley007 · April 8, 2022, 9:00am

I think it’s mostly the same one, but it’s a part of 4-port Ethernet card so I’m not sure it matters. There should not be any conflicts I think, when everything is working as it should, the host has only one interface visible, br0 everything physical goes to the container.

I will get back to you with logs when I get a chance, it will take some doing because of my config.

I should add that this does not happen every time, but I’ve started to just reboot the whole host when needed.

tomp · April 8, 2022, 9:07am

So I had a look at the liblxc source code and found this:

github.com

lxc/lxc/blob/master/src/lxc/network.c#L3433-L3487

      
        
            /*
             * LXC moves network devices into the target namespace based on their created
             * name. The created name can either be randomly generated for e.g. veth
             * devices or it can be the name of the existing device in the server's
             * namespaces. This is e.g. the case when moving physical devices. However this
             * can lead to weird clashes. Consider we have a network namespace that has the
             * following devices:
            
            
 * 4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
             *    link/ether 00:16:3e:91:d3:ae brd ff:ff:ff:ff:ff:ff permaddr 00:16:3e:e7:5d:10
             *    altname enp7s0
             * 5: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
             *    link/ether 00:16:3e:e7:5d:10 brd ff:ff:ff:ff:ff:ff permaddr 00:16:3e:91:d3:ae
             *    altname enp8s0
             *
             * and the user generates the following network config for their container:
             *
             *  lxc.net.0.type = phys
             *  lxc.net.0.name = eth1
             *  lxc.net.0.link = eth2

This file has been truncated. show original

I wonder if NIC is clashing with another NIC inside the container and then not being renamed so that when its moved back LXD doesn’t recognise it to rename it back to its original name.

Crowley007 · April 8, 2022, 9:44am

Got it to fail on the very first restart, Here is the info you asked for:

Lost 2 interfaces this time.

This very well might have something to do with my Open-wrt container, I use the same PC as my server and router.

As a side note, how does the new feature “Startup with degraded networking” work? Shouldn’t it start the container even if the NICs are missing?

tomp · April 8, 2022, 9:52am

No its for allowing LXD to start without starting all its managed networks, not for allowing an instance to start without all its devices.

tomp · April 8, 2022, 10:04am

Can you reproduce this by running lxc stop <instance> and then show the output of lxc info <instance> --show-log?

Crowley007 · April 8, 2022, 10:15am

It just shows:

Name: router
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2021/11/03 23:19 EET
Last Used: 2022/04/08 12:35 EEST

Log:

lxc router 20220408101124.376 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 2 from "eth3" to its initial name "enp4s0f0"
lxc router 20220408101124.379 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 3 from "eth4" to its initial name "enp4s0f1"

tomp · April 8, 2022, 10:20am

OK well thats something, we can see its liblxc having trouble renaming the interface.

tomp · April 8, 2022, 10:22am

@brauner do you have any idea why lxc_netdev_rename_by_index would fail renaming an interface back to the host side name when the container stops?

brauner · April 8, 2022, 10:38am

It can happen if there’s a network device on the host with the same name. Other than that it’s not obvious what would cause it.

brauner · April 8, 2022, 10:44am

When the container is stopped LXC will move the network device back to the host. In order to that it will use a “transient” name which it has used during interface creation. It’s basically a low-effort way to avoid name collisions on the host when moving a network device back that usually has a high-collision probability name such as “eth0” in the container.

In the final step it is renamed from the transient name to its original name on the host. Since the rename step fails after the device has been moved back it makes it somewhat likely that it’s a naming collision, i.e. it’s original hostname has been taken by another device.

tomp · April 8, 2022, 10:47am

Perhaps something on the host is renaming an earlier NIC to the same name as a latter NIC to be removed that is causing the conflict.

tomp · April 8, 2022, 10:48am

Does it only happen if you have multiple NICs in your container?

Crowley007 · April 8, 2022, 12:54pm

I guess I can test later, but I will always have multiple nics in the container, it is a router/switch after all.

That collision thing seems probable. If I disable Predictable Network Interface Names, Host nics stay as eth0 etc. instead of enps, and then the container wont start at all.

Perhaps I could try renaming container nics to eth01 etc do avoid collision.

tomp · April 8, 2022, 2:23pm

I’m not saying that is a problem, but it may indicate what is happening, perhaps something on the host is restoring them to the same name.

Crowley007 · April 8, 2022, 8:32pm

Ok, I did test this by removing all but one physical NIC from the container and adding them back one by one. The problems start when adding the third one of four.

tomp · April 13, 2022, 8:46am

I wonder if something on your host machine is renaming the NICs as they are added back to the host, causing the conflict.

Crowley007 · April 13, 2022, 12:58pm

No idea, I have Ubuntu server with minimal extra packages. Predictable Network Interface Names does this on boot of course, but like I said before, If I disable that the LXC container wont start even once and I have those phys*** nics listed when it has tried.

I’m using only systemd-networkd with only br0 configured if that makes any difference. And I compile lxd from source, I don’t have snap installed.