Need help to explain network incident

Hi all,

I am trying to explain an incident we had on a production cluster of 10 machines.
The containers and VMs are mainly connected to an unmanaged Linux bridge (with VLAN, no STP), with the exception of one container connected to an OVN network.

The symptom seems to be that an L2 loop was created through br-int and the associated Geneva tunnels during an uplink reconfiguration (which appears to have failed, adding ipv6.routes prefix). Total loss of connectivity on the platform.
The physical switch logs show the container MAC addresses flapping between the different physical ports of the servers.

I don’t know the details of how OVN/OVS works. But is this a possible situation? A design error (the parent of the uplink is connected to the linux bridge previously mentioned) ?

I can share details of the network topology and versions.

Thanks for your help,

Guillaume

Hmm, that sounds pretty weird.

Our usage of OVN is basically designed for this scenario where the uplink is a shared L2 accessible on all hosts. We then have the external leg of the OVN router for each network sit on that uplink network (with its own MAC) and the internal leg sits in an OVN logical switch.

The OVN Geneve tunnels are only used to carry traffic within the internal logical switch, it should never interact with the external side of things.

Now one thing that could make a mess is the ARP proxying that we’re doing with OVN to handle ingress of ipv4.routes/ipv6.routes when there is no better way (like BGP) to get that traffic routed to the correct OVN router.

That basically causes OVN to respond to ARP/NDP for any address in those configured routes to get that traffic to ingress correctly, maybe your reconfiguration caused something weird to happen around that?

See ovn.ingress_mode to control that behavior.

Thanks Stéphane, that’s exactly how I imagined an OVN network would work.
In this setup, internal IPs are advertised to the firewall in BGP. NAT (v4 and v6) is disabled (pure routing). But ovn.ingress_mode is not configured, so it’s probably defaulting to “l2proxy.” Should it be configured to “routed” in this case?

The documentation on this subject refers to external IPs: “Sets the method how OVN NIC external IPs will be advertised on uplink network.”

What confuses me is that the MAC address seen as flapping between hypervisors was not connected to the OVN network.

Last logs from ovn-controller before outage

2025-10-14T14:37:23.987Z|00356|lflow|WARN|error parsing match “(outport == “incus-net2-ls-int-lsp-router” && ip6.dst == \<nil\> && (udp.dst == 53 || tcp.dst == 53))”: Syntax error at `<’ expecting constant.

2025-10-14T14:37:30.802Z|00357|binding|INFO|Changing chassis for lport cr-incus-net2-lr-lrp-ext from 8c033b01-7d92-435c-aad6-3ece8381b4ed to a89ca6bf-3d74-4feb-ad73-0c6f0658b7e0.

2025-10-14T14:37:30.802Z|00358|binding|INFO|cr-incus-net2-lr-lrp-ext: Claiming 10:66:6a:64:d2:e6 10.100.101.220/24

2025-10-14T14:37:30.806Z|00359|binding|INFO|Setting lport cr-incus-net2-lr-lrp-ext up in Southbound

Et incusd.log

time=“2025-10-14T16:37:23+02:00” level=error msg=“Failed notifying dependent network” dependentNetwork=ovn_front driver=physical err=“Failed allocating uplink port IPs on network “ovn_front_uplink”: Uplink network doesn’t have IPv4 or IPv6 configured” network=ovn_front_uplink project=default

Another point to note, if it can be helpful, is that the uplink was previously configured for IPv4 only. Then ipv6.routes was added but ipv6.ovn.ranges ommited. Could this inconsistency have caused a configuration problem?

I looked into the cause again, and ultimately I think it requires a bug report. https://github.com/lxc/incus/issues/2630