[LXD] BGP address/route advertisement

stgraber · July 13, 2021, 3:40am


Project	LXD
Status	Implemented
Author(s)	@stgraber
Approver(s)	@stgraber @tomp
Release	4.18
Internal ID	LX004

Abstract

Add the ability for LXD to advertise the subnets in use for its networks and instances to external routers over BGP.

Rationale

When running production environments on LXD, especially on LXD clusters, it’s getting more and more common to want to directly route external addresses to specific instances or even to have an entire LXD network backend by external (non-NATed) addresses. The most common being a mixed setup where IPv4 uses a RFC1918 subnet but then directly routes external addresses on a as-needed basis while IPv6 connectivity directly uses an external subnet with additional (non-EUI64) addresses routed as needed.

Then with the recent addition of OVN support, it’s now possible to have non-admin cluster users self-created such networks, including having them get to use external addresses through LXD’s support for delegating subnets through projects.

All of that works great today but relies on the network or system administrator having put routing in place so that all of those external addresses and subnets get routed to the right LXD server or to the right OVN gateway. This manual step gets in the way of allowing on-demand creation of OVN networks by regular users and have them route any of the addresses or subnets that are allowed in their projects.

With the addition of native BGP support to LXD, the administrator will be able to setup BGP peers on the relevant external LXD networks and LXD will then take care of advertising all relevant routes and next hops directly to the routers.

The concept was proven and is already in use in production environments through an external tool: GitHub - stgraber/lxd-bgp: A tiny BGP server in Go exposing LXD external routes

Specification

Design

With this change, LXD will become an optional BGP router, though in practice will only ever advertise routes and won’t do anything with any received prefixes.

At the global level, there will be a configuration option to configure a listen address and port for LXD’s built-in BGP server. Then at the network level, there will be configuration for the relevant peers to notify as new addresses and subnets get used in LXD.

To keep things simple initially, LXD will be announcing IPv4 and IPv6 prefixes over either protocol, requiring only a single session per peer (preferably IPv6 if dual-stack).

Scenarios

Bridged network with IPv4 and /or IPv6 subnet using external addresses

This is probably the simplest environment where LXD operates a traditional managed LXD bridge and where the IPv4 and/or IPv6 subnets are not NATed. LXD will advertise the relevant subnets to the router which will then route them to the correct host.

OVN network with IPv4 and/or IPv6 subnet using external addresses

In this environment, we’ll need the uplink network for that OVN network to have BGP peers configured. When that’s the case, the IPv4 and/or IPv6 subnets will be advertised with a next-hop set to the OVN gateway.

In a cluster, all cluster members will be advertising the same route as OVN is distributed, so high availability of the route is expected.

External addresses/subnets routed to a specific instance on a bridged network

With this case, the instances will likely be running on private addressing with LXD’s dnsmasq in charge of assigning addresses. LXD will look for ipv4.routes.external and/or ipv6.routes.external to know what needs to be routed to the instance, will then setup a suitable route on the host (same as ipv4.routes or ipv6.routes) and finally will advertise a route to the host over BGP.

External addresses/subnets routed to a specific instance on a OVN network

In this setup, the instance’s OVN network may or may not be using external addressing itself, but the instance’s nic device has a ipv4.routes.external or ipv6.routes.external config key set.

In such a setup, LXD will be advertising a route for the relevant external routes on the instance nic. In a cluster, only the host of the instance will be advertising the route as if the host becomes incapacitated, so is the instance.

Integration with instance lifecycle

LXD will begin advertising instance-specific routes shortly after completing the instance startup sequence and will withdraw the advertisement shortly prior to shutting down the instance.

For networks, advertisements will be kept active so long as the network exists.

Behavior on LXD restart/update

As LXD will be the BGP router. Under normal circumstances, LXD exiting would cause all routes to be dropped by the upstream routers.

To prevent this, LXD will make use of BGP’s graceful restart feature, allowing a few minutes of downtime prior to the routes expiring when shutdown for refresh/update.
When a full shutdown is requested (lxd shutdown or SIGPWR), LXD will instead withdraw all advertisements prior to shutdown.

API changes

No REST API changes are expected for this, however LXD will grow an additional listening port (typically tcp/179) when BGP is enabled and there will be a few additional configuration keys and tweaks to existing configuration.

New global configuration keys:

core.bgp_address (local, disabled by default, takes <ip>:<port>)
core.bgp_asn (global, empty by default, takes the local ASN)

Network-specific configuration keys:

bgp.peers.<name>.address (global, peer <ip>:<port>)
bgp.peers.<name>.asn (global, peer ASN)
bgp.peers.<name>.password (global, peer password, optional)
bgp.ipv4.nexthop (local, for bridged networks, override next hop)
bgp.ipv6.nexthop (local, for bridged networks, override next hop)

NIC-specific configuration keys:

ipv4.routes.external (now supported on bridged interfaces)
ipv6.routes.external (now supported on bridged interfaces)

CLI changes

No CLI changes, only affects config keys.

Database changes

No schema change, only affects config keys.

Upgrade handling

No special handling needed as the feature did not previously exist.

Some constraints will get relaxed to allow the use of ipv4.routes.external and ipv6.routes.external on regular bridges but won’t change behavior on upgrade.

Further information

Prototype: GitHub - stgraber/lxd-bgp: A tiny BGP server in Go exposing LXD external routes

This will also enable proper anycast setups through the use of ECMP routes.
When ipv4.routes.anycast and/or ipv6.routes.anycast are configured, multiple instances on OVN networks will be able to advertise the exact same address or subnet.

This then results in multiple routes on the upstream router with equal weight. The L3 information then gets hashed and traffic gets balanced between the instances.

tomp · July 19, 2021, 9:19am

Looks good to me!

One part I wasn’t clear on was these two config options of each network:

Is this for overriding the host’s external IP used to advertise the next hop address to the upstream routers? And does “local” mean it can be specified on a per-cluster-member basis? Can we automatically guess that based on the route used to reach the BGP peer address?

stgraber · July 19, 2021, 11:25am

It’s indeed a per-member option to override the next-hop advertised to the BGP router. By default, we’d go with 0.0.0.0/0 or ::/0 which would just be ourselves (whatever is used for the session), but this may not always work well when using a single BGP session for two network families or in some environments where hosts have several IP addresses for traffic segregation.

tomp · July 19, 2021, 11:36am

Ah excellent, thanks.

bmullan · July 19, 2021, 3:59pm

This is a terrific addition for LXD networking capabilities, thanks so much!

stgraber · July 21, 2021, 2:54am

Are you happy with this spec? If so, I’ll mark it as approved and start on implementation.

tomp · July 21, 2021, 8:09am

@stgraber I do have one more question regarding the external routes on a bridged NIC.

With the equivalent settings on an OVN NIC, the external routes are routed in OVN to the specific NIC’s IP using the OVN allocated (or statically defined) IP address on the NIC. These are always available and are statically allocated by OVN, even if not specifically specified by the user.

In comparison the bridged NICs don’t always have an IP allocated, and it can change during the NICs lifetime via DHCP lease changes.

Should we require a force/require a static allocation for NICs using this feature, like we do with IP filtering and proxy NAT mode? So that we have a specific target IP for the host side routes?

Also, are there any considerations regarding fan networks? I think it would be OK as each LXD server will have its own local subnet on the fan that it can advertise independently of the overall fan subnet.

stgraber · July 21, 2021, 1:30pm

I don’t think we do. The next-hop will be set to the host’s address and the host will then send the traffic to the instance. This is different from OVN as with OVN it wasn’t possible to effectively put a /64 or similar to be on-link without needing to add NAT entries for every single address. That isn’t a problem with bridged networking.

So my current tests are to simply have ipv4.routes.external and ipv6.routes.external in the bridged case be treated identically to the current ipv4.routes and ipv6.routes but with the difference that the host will advertise the route over BGP.

stgraber · July 21, 2021, 1:31pm

My current thought there is to just ignore this case. The goal of the Fan is to make a bunch of hosts magically routed between each other so they can reach their instances.

The BGP support makes that happen without a fan bridge. You can just have a lxdbr0 on each of the cluster server and they’ll get routed networking through their subnets being advertised to the router over BGP.

tomp · July 21, 2021, 1:36pm

I see. That is somewhat different from the OVN behaviour, as it won’t really be routing the traffic to the NIC like with OVN, but rather it will be routing it to the bridge, at which point we rely on the instance’s NIC to be responding to ARP/NDP requests for the external IPs (like we do with ipv{n}.routes now). Whereas with OVN NICs the traffic is actually routed to the NIC’s private IP, and you don’t actually need to advertise the external IPs onto the bridge network via ARP/NDP.

Probably worth mentioning that in the reference docs to highlight the difference.

tomp · July 21, 2021, 1:36pm

I approve the design

stgraber · July 21, 2021, 1:39pm

Yeah, that’s a general difference in behavior of ipv4.routes/ipv4.routes.external and ipv6.routes/ipv6.routes.external between bridged and OVN.

I think we could change the logic to set a via whenever an ipv4.address or ipv6.address is set on the NIC. That wouldn’t make any difference for the BGP feature but would allow getting the same behavior as OVN in some cases.

tomp · July 21, 2021, 1:42pm

Yes that would be cool. As you say it wouldn’t affect the routes announced by BGP, just those actually setup on each host.