My default lxdbr0 suddenly lost connectivity to containers

I was fiddling with nmcli and resolvectl trying to configure DNS forwarding to a container within the LXD network when noticed that my containers got no IP addresses.

lxdbr0 here is a managed bridge, and containers previously acquired DHCP leases normally.

When I assigned IP addresses inside containers they started pinging each other, but could not ping the bridge address. In the packets I captured on lxdbr0 with tshark I could see ARP requests for bridge’s address but no replies. And the same with pinging containers from the host: ARP requests for container’s address without replies.

I created another managed network and attached a couple of containers to it. They acquired addresses (after some reconfiguration), so it was possible to delete the default network, then recreate it, and it would presumably resolve the issue. But I wanted to find out the cause and fix it for real. This is where I decided to post on this forum asking for help.

But first I needed some more data to attach, and tshark’s interface isn’t very comfortable to drill into individual packets, so I fired up Wireshark and started capture on lxdbr0. Suddenly I noticed something tshark didn’t show me: packets from containers were VLAN tagged with VID 1, and packets from the bridge were untagged. I checked with another bridge: packets went untagged in both directions. What could have caused this pointless tagging? Cursory search on linux bridge vlan tags got me to this Unix StackExchange question: https://unix.stackexchange.com/questions/546136/bridged-interfaces-and-vlan-tags. So when I typed in the command to view current vlan configuration:

> bridge -d vlan
port	vlan ids
lxdbr0	None
lxdbr1	 1 PVID Egress Untagged

vethe722bae8	 1 PVID Egress Untagged

vethe044557c	 1 PVID Egress Untagged

veth3e3bc5e4	 1 PVID Egress Untagged

veth42e3c8f4	 1 PVID Egress Untagged

— it left me somewhat confused. Turns out everything is by default tagged, and my bridge somehow lost its PVID. Well, it could be fixed with this somewhat weird incantation: bridge vlan add dev lxdbr0 vid 1 pvid untagged self. Containers got their addresses straight away and could access the outside network.

The only questions left are how and why this happened and if this can happen again

By the way you describe it, it sounds like you’ve removed the address from the lxdbr0 interface.

Given that nmcli is used for configuring interface addresses (amongst other things), its likely that what ever you did with that has somehow remove the lxdbr0 config.

Please show output of ip a and ip r on the host, as well as the output of sudo ps aux | grep dnsmasq and sudo ss -ulpn from the host.

Whelp, the problem repeated itself after a reboot.

@tomp no, ip addresses are ok (because re-adding the bridge’s “cpu” port to vlan 1 wouldn’t fix the issue otherwise):

> ip addr show lxdbr0
15: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:7c:32:8f brd ff:ff:ff:ff:ff:ff
    inet 10.189.141.1/24 scope global lxdbr0
       valid_lft forever preferred_lft forever
    inet6 fd42:e401:af11:5962::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::b1e2:7d27:f42b:602f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

I don’t know golang, but this line: https://github.com/lxc/lxd/blob/fc2694b6d4fc032a1fe07f3e94c364ccb9ce70fa/lxd/network/driver_bridge.go#L564 — suggests that bridge is created with default settings, and then VLAN filtering is enabled here: https://github.com/lxc/lxd/blob/fc2694b6d4fc032a1fe07f3e94c364ccb9ce70fa/lxd/network/driver_bridge.go#L690 — via sysfs manipulation.
On my system now they are set as they are supposed to:

> cat /sys/class/net/lxdbr0/bridge/vlan_filtering
1
> cat /sys/class/net/lxdbr0/bridge/default_pvid
1

What’s more interesting is that new bridges are by default created with their “self” port added to VLAN 1, i.e.

> sudo ip link add tmpbr0 type bridge
> ip link show dev tmpbr0
26: tmpbr0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:48:cf:93:3a:0f brd ff:ff:ff:ff:ff:ff
~
> bridge vlan show dev tmpbr0
port	vlan ids
tmpbr0	 1 PVID Egress Untagged
> bridge -d link show dev tmpbr0
26: tmpbr0: <BROADCAST,MULTICAST> mtu 1500 master tmpbr0 tmpbr0

Compare the output of last two commands to lxdbr0:

> bridge vlan show dev lxdbr0
port	vlan ids
lxdbr0	None
> bridge -d link show dev lxdbr0
15: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master lxdbr0

I’ve disabled nmcli settings connection.autoconnect and bridge.stp that were set by, I suspect, ansible’s nmcli module.
Let’s see what happens tomorrow after I turn the machine up again

You don’t want nmcli to be touching lxdbr0. Rebooting the system will allow LXD to recreate and reconfigure it as normal.

Reboot didn’t work last time…

Yes, I would prefer NetworkManager not touch it. But both of my current LXD bridges show up green in nmcli connections and nmcli devices. I guess my next step would be finding a way to make them “unmanaged”, if removing “autoconnect” would prove insufficient

OK, this time it did the trick