OVN + BGP Setup in cluster - only one active chassis

elvis.chen · February 6, 2023, 10:15am

Hi,

We would like utilize the lxd ovn support to manage our internal networking.

The plan is to use a lxdbr0 as the UPLINK and create multiple OVN subnets based on that.

In a 3-node-cluster test environment, I’ve managed to successfully advertise the OVN subnet as well as the UPLINK network to the BGP peer.

My problem:

only one member of the cluster is used as the active chassis for the subnet. (I think this is expected)
But BGP does not know which one is active, just randomly pick one cluster member

In my case,

# lxd cluster members
10.56.123.150
10.56.123.151 (Chassis)
10.56.123.152

And the lxc network info of my ovn network shows:

config:
  bridge.mtu: "1442"
  ipv4.address: 10.77.21.1/24
  ipv4.nat: "false"
  ipv6.address: none
  network: lxdbr0
  volatile.network.ipv4.address: 10.34.56.11

In my router (I’m using FRR on a VM as a virtual router), show ip bgp gives

   Network          Next Hop            Metric LocPrf Weight Path
*= 10.34.56.0/24    10.56.123.152                          0 65101 i
*=                  10.56.123.150                          0 65101 i
*>                  10.56.123.151                          0 65101 i
*  10.77.21.0/24    10.34.56.11                            0 65101 i
*>                  10.34.56.11                            0 65101 i
*                   10.34.56.11                            0 65101 i

When I want to access the OVN network (10.77.21.0/24), I guess the route goes like this:

external machine
-> 10.34.56.11
-> Randomly pick one of 10.56.123.[150-152]  # Only 151 should be picked?

My experiments also show similar result: all the pings go to 10.56.123.152, but 10.56.123.151 is the correct chassis.

I created several more ovn networks and it sometimes work, if the routed one is luckily the same as the active chassis.

What am I missing here, or is this the supposed behavior? Thanks for advice!

Additional info:

lxc network info lxdbr0:

config:
  bgp.peers.my-bgp.address: 10.56.123.11
  bgp.peers.my-bgp.asn: "7675"
  ipv4.address: 10.34.56.1/24
  ipv4.dhcp.ranges: 10.34.56.5-10.34.56.10
  ipv4.nat: "false"
  ipv4.ovn.ranges: 10.34.56.11-10.34.56.20
  ipv4.routes: 10.77.21.0/24
  ipv6.address: none
  tunnel.lan.interface: enp5s0
  tunnel.lan.protocol: vxlan
  tunnel.lan.ttl: "1"

tomp · February 9, 2023, 8:58am

This is expected. If that chassis goes down then one of the other ones will take over.
The active chassis chosen for a particular OVN network is random, so they won’t all use the same chassis at the same time (this is to try and spread the uplink traffic out across the cluster members).

tomp · February 9, 2023, 9:07am

The issue is that LXD OVN networks expect a shared layer 2 uplink:

A high availability OVN cluster requires a shared layer 2 network, so that the active OVN chassis can move between cluster members (which effectively allows the OVN router’s external IP to be reachable from a different host).
Therefore, you must specify either an unmanaged bridge interface or an unused physical interface as the parent for the physical network that is used for OVN uplink. The instructions assume that you are using a manually created unmanaged bridge. See Configuring network bridges for instructions on how to set up this bridge.

You are correct that every LXD cluster member currently advertises the OVN network’s external IP to all of the upstream routers, as it expects the next-hop address (volatile.network.ipv4.address 10.34.56.11) to be reachable from all the routers.

But in this case the 10.34.56.11 address isn’t directly reachable from the routers, and instead must go through the active chassis as the next-hop address.

We might be able to do something more intelligent here in scenarios where the uplink network isn’t a directly shared L2 link, such as only advertising the next-hop on the active chassis.

Please can you open an issue here Issues · lxc/incus · GitHub

In the meantime though, using a private bridge (such as lxdbr0) for the uplink is really only for testing single member deployments or where you don’t need ingress.

tomp · February 9, 2023, 9:09am

I suspect we should get LXD to advertise a more specific route for OVN network’s volatile.network.ipv4.address on only the active chassis.

elvis.chen · February 14, 2023, 10:29am

Thanks for the reply Thomas!

I think the reason why we’re using lxdbr0 as the uplink is because the cluster configurations are complex in our case. Not every member in the cluster shares a layer-2 network.

I’ll open an issue in the repo later. In the meanwhile, I’ve thought of some workarounds for this issue:

I’ve found that it’s possible to manually edit the volatile.network.ipv4.address field to make it the address of the lxd server rather than the OVN router. Combining with some SNAT it’s possible to advertise a reachable route to the BGP server.
- However, to make the HA of the OVN work, this field needs to be dynamically updated in case of server failure. Maybe by lxc monitor some events.
(Maybe) another way is to make the router accessible from all the lxd servers. I think theoretically it’s possible because the lxdbr0 is virtually a L2 network. I guess this would involve some manipulations in ARP and bridge fdb etc. I haven’t been able to make it work yet.
- Again this method also requires some dynamic monitoring of the correct chassis that the OVN router is bound to