Incus OVN NAT Failure when over 100 Networks

Forrest_Fuqua · January 31, 2025, 10:25pm

Incus 6.8
Debian Bookworm
ovn-host/stable,now 23.03.1-1~deb12u2
openvswitch-switch/stable,stable-security,now 3.1.0-2+deb12u1
Linux gcihost01 6.11.10+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.11.10-1~bpo12+1 (2024-12-19) x86_64 GNU/Linux

I’m running into a issue if that if I have more then about 100 logical routers in OVN, nat will start randomly failing for networks / new networks.

This seems to /somewhat/ be a known issue on the openvswitch mailing list as it has come up in openstack land, but was unresolved since its uncommon to have 100+ routers on a single node.

The only log lines that I can find are below that make sense when its dropping packets.

my current network setup is HW ETH0 → BRIDGE → VETH → OVN

I use a bridge on the main interface so I can still use it while openvswitch has ovn running

2025-01-31T22:18:41.428Z|00005|ofproto_dpif_xlate(handler5)|WARN|Dropped 666 log messages in last 63 seconds (most recently, 24 seconds ago) due to excessive rate
2025-01-31T22:18:41.428Z|00006|ofproto_dpif_xlate(handler5)|WARN|over 4096 resubmit actions on bridge br-int while processing icmp6,in_port=1,vlan_tci=0x0000,dl_src=00:16:3e:22:f3:1a,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::216:3eff:fe22:f31a,ipv6_dst=ff02::2,ipv6_label=0x72933,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=133,icmp_code=0
2025-01-31T22:19:40.906Z|00001|ofproto_dpif_xlate(handler162)|WARN|Dropped 175 log messages in last 59 seconds (most recently, 36 seconds ago) due to excessive rate
2025-01-31T22:19:40.906Z|00002|ofproto_dpif_xlate(handler162)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=00:16:3e:75:bf:85,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=100.65.0.254,arp_tpa=100.65.100.53,arp_op=1,arp_sha=00:16:3e:75:bf:85,arp_tha=00:00:00:00:00:00

Here is my incus network currently with some redactions

config:
  dns.nameservers: 100.65.0.254
  ipv4.gateway: 100.65.0.254/16
  ipv4.ovn.ranges: 100.65.50.10-100.65.59.254
  parent: UPLINK
  volatile.last_state.created: "false"
description: ""
name: UPLINK
type: physical
used_by:
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
- /1.0/networks/<REDACTED>
		.... etc x151 ...
managed: true
status: Created
locations:
- none
project: default

and my systemd managed veth

11: veth1@UPLINK: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master MAIN-NAT state UP group default qlen 1000
    link/ether 12:9e:a9:e6:30:b4 brd ff:ff:ff:ff:ff:ff
12: UPLINK@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
    link/ether 5a:a8:8b:e2:8d:3b brd ff:ff:ff:ff:ff:ff

stgraber · January 31, 2025, 10:51pm

I’ve seen systems running rather astronomical number of OVN networks that weren’t hitting this.
But they were running significantly newer OVS and OVN too.

I’m planning to set up a new package repository (similar to what I have for Linux, ZFS, Incus, …) with the latest stable OVS and OVN for Ubuntu and Debian. That should happen in the next few weeks and should make debugging such issues easier.

Forrest_Fuqua · February 1, 2025, 2:14am

I’m happy to test that! I’m a bit of a debian nerd and can even provide hosting as I work at a university (rochester institute of technology) I can even provide some compute resources for automated builds

Edit: For now, I’m going to work on compiling latest OVN/OVS with kernel module, Is there a good way to handle the database rebuild in incus? or do I need to purge out the networks and recreate them?

Forrest_Fuqua · February 12, 2025, 9:26pm

I should update that updating openvswitch/ovn (compiled from sources) did not resolve the issue.