LXD: Network interfaces get renamed, container restart fails

Crowley007 · April 8, 2022, 9:00am

I think it’s mostly the same one, but it’s a part of 4-port Ethernet card so I’m not sure it matters. There should not be any conflicts I think, when everything is working as it should, the host has only one interface visible, br0 everything physical goes to the container.

I will get back to you with logs when I get a chance, it will take some doing because of my config.

I should add that this does not happen every time, but I’ve started to just reboot the whole host when needed.

tomp · April 8, 2022, 9:07am

So I had a look at the liblxc source code and found this:

github.com

lxc/lxc/blob/master/src/lxc/network.c#L3433-L3487

      
        
            /*
             * LXC moves network devices into the target namespace based on their created
             * name. The created name can either be randomly generated for e.g. veth
             * devices or it can be the name of the existing device in the server's
             * namespaces. This is e.g. the case when moving physical devices. However this
             * can lead to weird clashes. Consider we have a network namespace that has the
             * following devices:
            
            
 * 4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
             *    link/ether 00:16:3e:91:d3:ae brd ff:ff:ff:ff:ff:ff permaddr 00:16:3e:e7:5d:10
             *    altname enp7s0
             * 5: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
             *    link/ether 00:16:3e:e7:5d:10 brd ff:ff:ff:ff:ff:ff permaddr 00:16:3e:91:d3:ae
             *    altname enp8s0
             *
             * and the user generates the following network config for their container:
             *
             *  lxc.net.0.type = phys
             *  lxc.net.0.name = eth1
             *  lxc.net.0.link = eth2

This file has been truncated. show original

I wonder if NIC is clashing with another NIC inside the container and then not being renamed so that when its moved back LXD doesn’t recognise it to rename it back to its original name.

Crowley007 · April 8, 2022, 9:44am

Got it to fail on the very first restart, Here is the info you asked for:

Lost 2 interfaces this time.

This very well might have something to do with my Open-wrt container, I use the same PC as my server and router.

As a side note, how does the new feature “Startup with degraded networking” work? Shouldn’t it start the container even if the NICs are missing?

tomp · April 8, 2022, 9:52am

No its for allowing LXD to start without starting all its managed networks, not for allowing an instance to start without all its devices.

tomp · April 8, 2022, 10:04am

Can you reproduce this by running lxc stop <instance> and then show the output of lxc info <instance> --show-log?

Crowley007 · April 8, 2022, 10:15am

It just shows:

Name: router
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2021/11/03 23:19 EET
Last Used: 2022/04/08 12:35 EEST

Log:

lxc router 20220408101124.376 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 2 from "eth3" to its initial name "enp4s0f0"
lxc router 20220408101124.379 WARN     network - network.c:lxc_delete_network_priv:3617 - Failed to rename interface with index 3 from "eth4" to its initial name "enp4s0f1"

tomp · April 8, 2022, 10:20am

OK well thats something, we can see its liblxc having trouble renaming the interface.

tomp · April 8, 2022, 10:22am

@brauner do you have any idea why lxc_netdev_rename_by_index would fail renaming an interface back to the host side name when the container stops?

brauner · April 8, 2022, 10:38am

It can happen if there’s a network device on the host with the same name. Other than that it’s not obvious what would cause it.

brauner · April 8, 2022, 10:44am

When the container is stopped LXC will move the network device back to the host. In order to that it will use a “transient” name which it has used during interface creation. It’s basically a low-effort way to avoid name collisions on the host when moving a network device back that usually has a high-collision probability name such as “eth0” in the container.

In the final step it is renamed from the transient name to its original name on the host. Since the rename step fails after the device has been moved back it makes it somewhat likely that it’s a naming collision, i.e. it’s original hostname has been taken by another device.

tomp · April 8, 2022, 10:47am

Perhaps something on the host is renaming an earlier NIC to the same name as a latter NIC to be removed that is causing the conflict.

tomp · April 8, 2022, 10:48am

Does it only happen if you have multiple NICs in your container?

Crowley007 · April 8, 2022, 12:54pm

I guess I can test later, but I will always have multiple nics in the container, it is a router/switch after all.

That collision thing seems probable. If I disable Predictable Network Interface Names, Host nics stay as eth0 etc. instead of enps, and then the container wont start at all.

Perhaps I could try renaming container nics to eth01 etc do avoid collision.

tomp · April 8, 2022, 2:23pm

I’m not saying that is a problem, but it may indicate what is happening, perhaps something on the host is restoring them to the same name.

Crowley007 · April 8, 2022, 8:32pm

Ok, I did test this by removing all but one physical NIC from the container and adding them back one by one. The problems start when adding the third one of four.

tomp · April 13, 2022, 8:46am

I wonder if something on your host machine is renaming the NICs as they are added back to the host, causing the conflict.

Crowley007 · April 13, 2022, 12:58pm

No idea, I have Ubuntu server with minimal extra packages. Predictable Network Interface Names does this on boot of course, but like I said before, If I disable that the LXC container wont start even once and I have those phys*** nics listed when it has tried.

I’m using only systemd-networkd with only br0 configured if that makes any difference. And I compile lxd from source, I don’t have snap installed.

sean · April 15, 2022, 9:56am

I did find evidence of interface name changes on my failing LXD 5.0 box.
I have just 1 NIC that I use (also a WiFi, but it was never configured in Netplan).

I got this (and then some) problems after a host reboot for maintenance. Now I’m getting a bunch of different errors on Ubuntu 20.04 with LXD 5.0 and containers can’t start.

I am using snapd with latest/stable; this was the first orderly restart after (unexpected) upgrade of LXD to 5.0 by snapd (I shouldn’t have used latest/stable, 4.0 worked fine before).

Anyway, as far as one of the issues - the network interface name change - is concerned, I see this in syslog:

Apr 15 06:42:45 server kernel: [   43.465960] device vethe34f7be2 entered promiscuous mode
Apr 15 06:42:45 server zed: eid=15 class=history_event pool_guid=0x7FA2A2DA6C8B7235  
Apr 15 06:42:45 server zed: eid=16 class=history_event pool_guid=0x7FA2A2DA6C8B7235  
Apr 15 06:42:45 server zed: eid=17 class=history_event pool_guid=0x7FA2A2DA6C8B7235  
Apr 15 06:42:45 server kernel: [   43.582430] audit: type=1400 audit(1650004965.651:61): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-mycont </var/snap/lxd/common/lxd>" pid=14896 comm="apparmor_parser"
Apr 15 06:42:45 server kernel: [   43.660602] physi2Ry4Y: renamed from vetha32d6312
Apr 15 06:42:45 server kernel: [   43.684939] eth0: renamed from physi2Ry4Y
Apr 15 06:42:45 server systemd-networkd[1579]: vethe34f7be2: Gained carrier
Apr 15 06:42:45 server kernel: [   43.709518] lxdbr0: port 2(vethe34f7be2) entered blocking state
Apr 15 06:42:45 server kernel: [   43.709520] lxdbr0: port 2(vethe34f7be2) entered forwarding state
Apr 15 06:42:48 server systemd[1]: systemd-hostnamed.service: Succeeded.

I looked around and it seems this could be related to systemd (current version 250, current version in Ubuntu 20.04 is 245 and at that time they still didn’t publish change logs so it’s hard to figure out what changed in 246 or 247, for example). I also looked at udev rules, but didn’t find any unusual, the same with networkd-dispatcher (running with in verbose mode revealed nothing new).

I also tried configuring systemd-network to start after udev. No difference.

[Unit]
After=systemd-udev-settle.service

My last try was to change default bridge to br0 and bring br0 up via Netplan rather than leave it to LXD, but that doesn’t help.

Pieter · November 3, 2022, 1:30pm

Does this fix it?

In /usr/lib/systemd/network/99-default.link

Remove ‘keep’ from NamePolicy.

[Match]
OriginalName=*

[Link]
NamePolicy=kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path
MACAddressPolicy=persistent

sskats · September 3, 2023, 4:47am

Sorry for replying to an old topic. I had the same trouble as the poster of this topic.
And today I am very lucky. I found this thread.

@Pieter I followed your advice and my container rebooted successfully.

I am a beginner in both Linux and LDX (and English), so I didn’t even know how to look for tips to solve the problem. Unfortunately, I don’t even really understand what the settings I made this time mean.
Regardless, I am so full of gratitude to you and all the people on this thread that I immediately created an account on this forum.

Thank you very much and have a nice weekend.