Macvlan networking suddenly stopped working; cloud-init not setting up interfaces and DHCP failing

I don’t know exactly when this happened but I suspect a recent auto snap LXD upgrade.

Anyway, I noticed that one of my containers was lacking an IP address on one of its network interfaces. Investigating further I noticed that the netplan file that cloud-init generates was missing that interface.

I added it in manually and noticed syslog was getting errors:
dhcpcd[315]: eth1: adding default route
dhcpcd[315]: if_route (ADD): Invalid argument

The broken interface is declared as macvlan in the container’s config. Only the bridged interface is coming up and working. I have other containers with only macvlan interfaces and none of them come up.

I’m running LXD 5.3. I’ve tried to downgrade LXD to 5.2/stable in snap but it won’t run the old one after I switch channels.

Can anyone help please, with
a) reverting to an older LXD to see if I can get my production system working again
b) debugging this issue

Many thanks.

Ok I managed to downgrade to 5.2 and it’s still broken. Any further help still appreciated.

cheers

Ok, update. I traced this down to the host bridge randomly going down. I restarted the bridge and the containers all immediately got their missing IPs. There was no indication it had gone down, it just wasn’t working :frowning:

So I think this is just a case where the error reporting needs to improve.

Hi,

So do your macvlan NIC devices connect to a bridge?

Is it a LXD managed bridge or something setup manually?

Please can you show an example lxc config show <instance> --expanded for an affected instance, as well as the ip a and ip r output from the host, so I can get a better understanding of your setup.

Hi Thomas

The bridge is manual. My interfaces file has these entries:

auto mac0
iface mac0 inet static
        address 192.168.4.5
        netmask 255.255.255.0
        broadcast 192.168.4.255
        gateway 192.168.4.1

auto enx00e020110743
iface enx00e020110743 inet manual
        pre-up ip link add mac0 link enx00e020110743 type macvlan mode bridge
        post-down ip link del mac0 link enx00e020110743 type macvlan mode bridge

The enx device is a pluggable NIC that I bring up and add the mac0 bridge with its own IP.

Also apologies I thought I’d already posted the configs. The eth sections look like this:

  volatile.eth0.host_name: mac7cc32bde
  volatile.eth0.hwaddr: 00:16:3e:45:a9:b7
  volatile.eth0.last_state.created: "false"
  volatile.eth0.name: eth0
  volatile.eth1.host_name: macf3cd495a
  volatile.eth1.hwaddr: 00:16:3e:ed:79:5e
  volatile.eth1.last_state.created: "false"
  volatile.eth1.name: eth1

And they are applied from a profile that has:

devices:
  eth0:
    nictype: macvlan
    parent: mac0
    type: nic
  eth1:
    nictype: macvlan
    parent: eth0
    type: nic

This has been working great for more than a year, and it just suddenly failed recently and I don’t know what happened, just that when I did ifdown mac0 then ifup mac0 it all burst back to life. There was no indication that mac0 was down in any way prior to that.

$ ip a|grep mac0
180: mac0@enx00e020110743: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.4.5/24 brd 192.168.4.255 scope global mac0

$ ip r|fgrep 4.0
192.168.4.0/24 dev mac0 proto kernel scope link src 192.168.4.5

Let me know if you need more info.

Interesting. I’ve never seen that sort of setup before where you are layering macvlan interfaces for an instance ontop of an existing macvlan interface (parent=mac0). Normally LXD macvlan NICs would use the physical parent directly, so in this case it would be parent=enx00e020110743.

I’m not saying its in anyway relevant to the issue at hand, but it caught my eye as perhaps an extra layer of complexity that isn’t needed necessarily. Thats one of the benefits of macvlan compared to a bridge, that you as the admin don’t need to create a manual macvlan interface and move the host’s IP onto it.

Is this setup used because the physical parent NIC is removable? Or are you trying to make the host reachable from the instances?

You mentioned on twitter that you suspect a snap refresh triggered this, can you provide the output of sudo snap changes lxd so we can see the refresh times. Although given that the parent interface is manually defined I can’t think of anything in LXD that would cause it to take down an unmanaged parent interface on reload.

Also is there any indication in your syslogs of parent mac0 or the physical link enx00e020110743 going down?

I just tried this sort of setup myself:

Physical interface (with IPs configured on it): enp2s0

Create macvlan interface in bridge mode linked to enp2s0 interface:

sudo ip link add mac0 link enp2s0 type macvlan mode bridge
ip l show mac0
12: mac0@enp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 56:c7:e8:62:27:05 brd ff:ff:ff:ff:ff:ff

So the mac0 interface is down.

lxc init images:ubuntu/jammy c1
lxc config device add c1 eth0 nic nictype=macvlan parent=mac0
lxc start c1
lxc ls c1
+------+---------+----------------------+--------------------------------------------+-----------+-----------+
| NAME |  STATE  |         IPV4         |                    IPV6                    |   TYPE    | SNAPSHOTS |
+------+---------+----------------------+--------------------------------------------+-----------+-----------+
| c1   | RUNNING | 192.168.1.185 (eth0) | 2a02:n:n:1:f864:36ff:fe28:f96a (eth0) | CONTAINER | 0         |
+------+---------+----------------------+--------------------------------------------+-----------+-----------+

Great, that works, even with the macvlan parent interface being down:

ip l show mac0
12: mac0@enp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 56:c7:e8:62:27:05 brd ff:ff:ff:ff:ff:ff

Bringing the mac0 interface up and down doesn’t seem to affect the instance’s connectivity.
I also tried bringing the physical parent interface down and back up and that did break connectivity, as expected, until it was brought back up again at which point the instance regained its dynamic IP.

Hi

The layering is done so that the plugging works on the removable NIC, and also so that the containers can see a host IP while using macvlan.

The reference to the snap update was a separate problem unrelated to this; I think I received a change that activated the check between snapshots and snapshot volumes and some Googling gave me the fix to run some raw SQL.

I looked back through syslog and I can’t see anything particularly interesting. avahi-daemon occasionally spews a load of entries regarding withdrawing/adding ipv6 records on the enx/mac0 interface but nothing says it went down anywhere. Interestingly, some days ago there’s an entry from NetworkManager reporting that the link came up

I also tried bringing the physical parent interface down and back up and that did break connectivity, as expected, until it was brought back up again at which point the instance regained its dynamic IP.

That’s pretty much what I saw when I restarted the link - instant dynamic IP leases obtained.

I am on 20.04 on this server though, I’m not sure if anything changed between there and 22.04.

That makes sense. Its interesting it works though, as I’ve observed when using macvtap interfaces with VMs that the resulting interface can only use the MAC address it is configured with, which I’d always thought implicitly meant layering interfaces ontop of a macvlan interface would result in only the original MAC address being usable. Clearly this isn’t the case, and TIL one can layer macvlan interfaces at least in such a way they propagate through to the underlying parent interface.

The reference to the snap update was a separate problem unrelated to this; I think I received a change that activated the check between snapshots and snapshot volumes and some Googling gave me the fix to run some raw SQL.

Ah OK, yes that is from a consistency check that was added recently to ensure that instance snapshot DB records all match up, as at some point in the past instance snapshots were created without their accompanying storage volume records which affected some older instances. We did add a patch to automatically fix a common scenario of inconsistency, but sadly not all cases were automatically fixable. See Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count - #53 by tomp for more info.

Also if you’re finding the LXD snap is refreshing at inconvenient times then there are ways to manage that which may work for you (including pinning at a specific feature release version), please see Managing the LXD snap

Interesting. And you noticed the problem started after that point?

Its hard to know what the issue was now that is has been fixed, as we can’t see the system in the state it was in at the time. Potentially we could’ve caught that situation and raised an error should it have occurred at instance start time, but as the instances were already running we do not periodically poll for parent interface state and report it as an error as such.

Yep I’ve knocked it back to a fixed point release rather than latest, cheers.

I don’t know exactly when it started, I just noticed that a load of my home automation had failed and then when I looked at the containers the next day, DHCP leases had not renewed. So it’s possible the link went down in a way that it didn’t look down if you see what I mean.

I think a good starting change you could make is to flash up some sort of error about failed lease renewal, or general failures on interfaces where they were previously working. Even better, some sort of notification mechanism :slight_smile:

Cheers.

1 Like