Working with the MTU on OVN Networks

So I’m finding myself setting up a cloud-init for every instance I have, in most cases “just” to set the MTU to be 2 less than the network MTU, in order to get external connectivity. This seems to be an unsustainable amount of unnecessary work.

Is there any way to set the instance MTU independently of the network MTU somewhere in the profile?

Background

My scenario is that I tune the MTU so it works for OVN internal traffic. Everything is happy including traffic between the cluster and the remote cluster over an OVN-IC link (via WG). However traffic via the OVN gateway out to the Internet hits fragmentation (this is just going out directly via a NAT generic router). In order to combat this, I change the instance MTU to be network MTU - 2, everything happy.

So, I reduce the network MTU by 2 and restart. Broken again, hitting fragmentation on external traffic. I set the instance MTU to be the new network MTU -2. Everything happy.

[Repeat until we hit 1280 and it won’t let me go any lower]

I set up cloud-init to configure MTU on each instance to 1280 with the network MTU set at 1300.
Everything very happy … except me and the OCI instances I just can’t get to take any sort of cloud-init or custom boot script.

I don’t understand “why” the instance MTU needs to be 2 less than the network MTU, and if there’s a way to remove this limitation that would be great, but I’ve no leads on how to do this. Otherwise, something in the settings like “instance MTU offset”, or the ability to set the instance MTU independently of the network MTU would work …

stgraber@shell01:~$ ping -4 linuxcontainers.org -M do -s 1472
PING linuxcontainers.org (45.45.148.7) 1472(1500) bytes of data.
1480 bytes from rproxy.dcmtl.stgraber.org (45.45.148.7): icmp_seq=1 ttl=61 time=6.06 ms
^C
--- linuxcontainers.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 6.057/6.057/6.057/0.000 ms
stgraber@shell01:~$ ping -4 linuxcontainers.org -M do -s 1473
PING linuxcontainers.org (45.45.148.7) 1473(1501) bytes of data.
ping: local error: message too long, mtu=1500
^C
--- linuxcontainers.org ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

stgraber@shell01:~$ ping -6 linuxcontainers.org -M do -s 1452
PING linuxcontainers.org(rproxy.dcmtl.stgraber.org (2602:fc62:a:1::7)) 1452 data bytes
1460 bytes from rproxy.dcmtl.stgraber.org (2602:fc62:a:1::7): icmp_seq=1 ttl=62 time=5.72 ms
^C
--- linuxcontainers.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.719/5.719/5.719/0.000 ms
stgraber@shell01:~$ ping -6 linuxcontainers.org -M do -s 1453
PING linuxcontainers.org(rproxy.dcmtl.stgraber.org (2602:fc62:a:1::7)) 1453 data bytes
ping: local error: message too long, mtu: 1500
^C
--- linuxcontainers.org ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

stgraber@shell01:~$ ip link show dev eth0
79: eth0@if80: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:3c:23:55 brd ff:ff:ff:ff:ff:ff link-netnsid 0
stgraber@shell01:~$ 

Yeah, I get it works for you, but …

# ifconfig eth0 mtu 1300 # network mtu
# ping -c1 -4 linuxcontainers.org -M do -s 1272
PING linuxcontainers.org (45.45.148.7) 1272(1300) bytes of data.
1280 bytes from rproxy.dcmtl.stgraber.org (45.45.148.7): icmp_seq=1 ttl=49 time=90.2 ms
# ping -c1 -4 linuxcontainers.org -M do -s 1273
PING linuxcontainers.org (45.45.148.7) 1273(1301) bytes of data.
ping: local error: message too long, mtu=1300

# speedtest
Testing download speed...........................Download: 4.30 Mbit/s
Testing upload speed.............................Upload: 105.80 Mbit/s

# ifconfig eth0 mtu 1280
# ping -c1 -4 linuxcontainers.org -M do -s 1252
PING linuxcontainers.org (45.45.148.7) 1252(1280) bytes of data.
1260 bytes from rproxy.dcmtl.stgraber.org (45.45.148.7): icmp_seq=1 ttl=49 time=92.4 ms
# ping -c1 -4 linuxcontainers.org -M do -s 1253
PING linuxcontainers.org (45.45.148.7) 1253(1281) bytes of data.
ping: local error: message too long, mtu=1280

# speedtest
Testing download speed...........................Download: 701.95 Mbit/s
Testing upload speed.............................Upload: 106.39 Mbit/s

Lowering the network MTU doesn’t help, the instance MTU always needs to be at least 2 less than the network MTU to get the speed. Whatever I do, if I make the instance MTU the network MTU (i.e. the default without messing with cloud init) my download is crippled. This speedtest is working against an internet URL, if I try iperf against an internal OVN address or OVN IC address, throughput if fine.

MTU is max transmit unit, the fact that changing your MTU improves your download speed means that you have some kind of problem with PMTU in the other direction and that you lowering the MTU is just a way to lower your MSS which then papers over the PMTU issue.

Basically what happens is:

  • Your instance sees an interface MTU of 1500
  • It sets the TCP MSS to match that (14XX)
  • Request goes out to the target server, it may or may not hit MTU
  • If MTU is hit, the PMTU notification tells the client to go with smaller packets

That part seems all fine. What happens next is the issue:

  • Server gets the request with the TCP MSS matching an MTU of 1500
  • Server therefore assumes it can respond up to 1500 MTU
  • It sends packets back to you and hits the MTU issue
  • It doesn’t get a proper PMTU notification back so it doesn’t know the packet was lost
  • You then mostly rely on TCP retry and luck for things to actually make it through

The proper fix would be to validate that PMTUd works reliably in both directions.
The common workaround and something I’d still always do in environments with WAN MTU < 1500 is to have the router which deals with the < 1500 WAN do MSS clamping.

If Linux, that’s done with iptables, look for clamp mss to mtu, it’s basically a bit of packet mangling that you do at the edge of your network so that the remote server receives an adequate MSS for TCP connections and doesn’t need to go through the whole PMTU dance for TCP traffic.

Ok, many thanks, that helps to understand the problem a lot. I’m struggling however to implement clamping inside OVN. I’ve tried at Linux kernel level and it would appear all the geneve / openswitch traffic bypasses the standard nftables, so I’m guessing there must be an OVN mechanism for this … not found it yet tho’…

I’ve never done it with OVN since my OVN deployments usually run the GENEVE traffic on networks with jumbo frames, so OVN has a functional 1500 MTU and the clamping instead tends to be needed somewhere outside of OVN due to going over some kind of tunnel (wireguard typically), so it’s then the wireguard router which has the clamping as that’s where the MTU gets lowered.

Instead of clamping to MTU you can also just use set-mss to set a specific MSS for any traffic going through your router, so that could also be an option as something you can do on a router outside of OVN.

Mmm, I’m still on quite a steep learning curve with OVN. Today I realised that although I have a WG link on each node connected to the far end of the IC, and that it can use that route to peer directly over the IC (i.e. the OVN-IC is effectively a mesh) on my cluster side it’s sufficiently dumb that it routes all outgoing traffic (i.e. the OVN gateway address) via one node. Although I lose nothing in bandwidth because I only have a 1G outgoing, having to route through an additional internal node before being forwarded to the outgoing router seems less than efficient. I’m guessing there’s a way to insert a route to get around this, but as a default behavior it’s a little disappointing.

I seem to be left with a handful of little issues like this that kind of spoil the soup, but once I can get a handle on them, overall it’s looking pretty good.

I’d love to make more use of OCI, but I’ve started to move back onto system containers. Trying to get multiple interfaces and cloud init into OCI images is proving to be too time consuming for me atm … :frowning:

Ok, I’ve made a bit of a discovery after reading the docs a little more. As the MTU issue seems to be connected to traffic going through geneve to the outside world, I wondered if I could bypass geneve for non ovn traffic - which would solve the problem.

It seemed previously that all ovn-encap-ip addresses needed to be on the same segment in order for ovs to correctly set up and peer both locally and over the IC, so to do that I was using the Wireguard IP address. Which means local peering traffic was hopping through the WG interface rather than going directly through Geneve. Not sure what the performance implications of this would be, bit it’s certainly not as efficient as it would be if traffic for OVN was going directly into geneve and skipping the WG hop.

Apparently this can be done, albeit a little convoluted. Maybe this is old news but it’s not something was obvious to me previously. You can set up ovn-encvap-ip with a list of addresses, so in this case the address of the local bridge AND the address of the WG interface, and it will setup TWO peering sessions and use the most appropriate one when pushing traffic.

On each node I’ve done something like this (different addresses obviously);

ovs-vsctl set open_vswitch . external_ids:ovn-encap-ip=192.168.4.9,192.158.2.5
ovs-vsctl set open_vswitch . external_ids:ovn-encap-ip-default=192.168.2.5 # bridge

As a result, local cluster traffic is routed through geneve over the local bridge, and IC traffic is routed directly through the WG interface to the remote end of the IC. Well … tcpdump seems to say this is the case anyway.

Looking at one of the nodes in question;

# ifconfig br1|grep netmask
        inet 192.168.2.5  netmask 255.255.255.0  broadcast 192.168.2.255
# ifconfig ovn|grep netmask
        inet 192.168.4.9  netmask 255.255.255.0  destination 192.168.4.9
# ovs-vsctl show # on another node
Bridge br-in

Port ovn-a693a6-0
  Interface ovn-a693a6-0
  type: geneve
  options: {csum="true", key=flow, remote_ip="192.168.2.5"}
    bfd_status: {diagnostic="Control Detection Time Expired", flap_count="3",
    forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Port ovn-a693a6-1
  Interface ovn-a693a6-1
  type: geneve
  options: {csum="true", key=flow, remote_ip="192.168.4.9"}
    bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true",
    remote_diagnostic="Control Detection Time Expired", remote_state=up, state=up}

For an Incus/OVN setup it feels like this should be the default … doesn’t fix my MTU problem tho’, all I’ve been able to establish today (I think) is that OpenVSwitch doesn’t facilitate mss_clamping or indeed messing with the MSS at all. You can set the mtu_request on the virtual interface, which seems to work, but doesn’t seem to help with the issue … still looking …

I’ve added a local network / bridge interface to the same instance with a default MTU of 1500 and it works fine … so looks like it’s going to be something to do with geneve passing traffic “out” of it’s network.

@stgraber , so I’m having a hard time “proving” this, but from what I can see and read, some (or all) traffic leaving an instance is subject to the instance MTU, which is by default the same as the network MTU (in an OVN network) which works fine for traffic within the OVN network.

When traffic is routed “out” of the OVN network, at some point after leaving the instance, the packet size is increased, but then hits the MTU of another layer, hence fails. i.e. it requests fragmentation and is dropped. I don’t know precisely how the different layers interact, but it seems the geneve header has a variable length option section which can change depending on a packet is be routed, so there may be an expectation of different encapsulation lengths based on the traffic target … either way there seems to be something screwey with MTU’s for some traffic being forwarded through geneve.

  • Maintaining the network MTU while dropping the instance MTU by 2 is a 100% fix
  • Routing external traffic out through a second interface (local bridge) is a 100% fix
  • Routing external traffic through geneve “seems” to be the problem

Whereas I can see the cause and effect, I can’t actually dump evidence, or at least not yet, maybe more will click over time.

Either way, a solution (which would seem to be desirable irrespective of this problem) would be (from what I read?) for Incus to create a localnet port on each switch and then map non-ovn traffic to this port. (using ovn-bridge-mappings?) I’m guessing there will be instances where you want to route all traffic through one point, but having at least the option to route directly opens up the possibility of an (n)x network performance increase where n=cluster nodes. (which would be good?)

Unless I’m missing something in Incus … I can see Kubernetes seems to have an explicit way to control OVN/egress routing to achieve this … is there already something similar in Incus I’ve not seen?

I’m not sure if things have changed within OVN to be able to accommodate the real-time syncing of state within multiple OVN chassis needed to correctly support having the traffic out of a logical router be sent out of the local chassis.

Basically the issue is around NAT and/or stateful ACL rules as those generate state entries and then generally require the return traffic to go through the same path.

Mm, not sure I’m understanding the finer points of this. I understand this is an issue for things like load balancers where you’re potentially dealing with multiple routes for the same traffic … but for “normal” traffic I only see three scenario’s …

  • traffic is routed within the OVN network
  • traffic is coming in from an external network and being sent back to that external network
  • traffic is originating inside the OVN network and being sent back to inside that network

The first is kind of a done deal.

The second, I’m assuming that in this scenario traffic is natted as it comes into the OVN network, so once inside the network the return address is that of the gateway it came through … so it will return via the same path and traffic will be symmetrical.

For traffic leaving the network via a local bridge, it will be natted outgoing at the bridge and hence return through the same bridge, again symmetrical.

If you route traffic in without natting then sure you have a problem, but to me this kind of negates the whole point of a private network so not necessarily default or expected behaviour.

I have an instance at the moment that straddles my local network and two OVN networks with the default address pointing out via the local bridge. Two OVN networks work fine (MTU 1280) and the Internet access via the bridge works fine (MTU 1500) … but this is using three interfaces within the instance. What I’m suggesting is that the local bridge in this scenario could be implemented in the OVN switch, with a local route(s) to the OVN network range pointing at the network gateway and the default route pointing at the bridge gateway.

Note on Docker

I’ve been having trouble updating my discourse Forums since moving them onto the OVN network, even after updating the MTU to 1280. Turns out the docker network defaults to 1500 so the instance of docker inside the container “also” needs to be pinned to a lower MTU.

systemctl edit docker

[Service]
ExecStart=
ExecStart=/usr/sbin/dockerd --mtu=1280 -H fd:// --containerd=/run/containerd/containerd.sock

Or whatever new MTU the network is using.

If traffic for external networks is routed via a local bridge, it would just continue to work and no changes would be needed.

The current routing default introduces a significant challenge when it comes to moving pre-existing services onto an OVN / Clustered setup. It would be great if the bar to entry weren’t quite so high :slight_smile:

I’ve only found one solution that works for me, and that’s to avoid OVN completely.