[LXD] OVN network to network routing

tomp · September 14, 2021, 1:30pm


Project	LXD
Status	Implemented
Author(s)	@tomp
Approver(s)	@stgraber
Release	4.20
Internal ID	LX008

Abstract

Provide the ability to specify peering relationships between OVN networks (including across projects) so that network traffic between the OVN networks stays within the OVN subsystem and doesn’t leave OVN and then re-enter.

Rationale

Currently traffic between OVN networks exits the OVN subsystem via the source network’s virtual router and goes into the uplink network where it may then re-enter the OVN subsystem via the target network’s virtual router. This is inefficient and means that the network bandwidth is limited by the uplink network’s capabilities. If the OVN setup is using faster networking for internal traffic, then it would also be possible to use the same faster networking capabilities for OVN<->OVN traffic by allowing peering relationships to be configured between OVN networks.

Specification

Design

OVN supports creating peering links between virtual routers by adding router ports to each router and setting the peer property of the router ports to the respective router port name.

For example, to create a peering link between two existing virtual routers:

lxd-net1-lr (LAN subnets 10.110.120.0/24, fd42:7832:3b4e:cffb::/64)
lxd-net2-lr (LAN subnets 10.105.164.0/24, fd42:5389:62b9:be7c::/64)

We can create two router ports, one on each router and reference the other as the peer.

In order to avoid having to setup a separate peering subnet, the existing MAC and IPs of the virtual router’s port on the internal LAN have been used, albeit with a single-host subnet (e.g. /32 for IPv4 and /128 for IPv6). This effectively adds the same address to multiple ports on each router. This will become important when actually setting up static routes that use the peering connection, as we will need to explicitly specify the router port to use, as OVN will not be able to deduce the correct port to use for the peering link automatically.

ovn-nbctl lrp-add lxd-net1-lr lxd-net1-lr-lrp-net-2 00:16:3e:d6:73:26 10.110.120.1/32 fd42:7832:3b4e:cffb::1/128 peer=lxd-net2-lr-lrp-net-1
ovn-nbctl lrp-add lxd-net2-lr lxd-net2-lr-lrp-net-1 00:16:3e:3f:e4:9f 10.105.164.1/32 fd42:5389:62b9:be7c::1/128 peer=lxd-net1-lr-lrp-net-2

LXD will then need to setup static routes on the respective virtual routers for the peered local subnets. The static routes will need to use the target router’s IP that was added on the peering ports for the nexthop address and explicitly specify the local peering router port to use for egress traffic, e.g:

ovn-nbctl lr-route-add lxd-net1-lr 10.105.164.0/24 10.105.164.1 lxd-net1-lr-lrp-net-2
ovn-nbctl lr-route-add lxd-net1-lr fd42:5389:62b9:be7c::/64 fd42:5389:62b9:be7c::1 lxd-net1-lr-lrp-net-2
ovn-nbctl lr-route-add lxd-net2-lr 10.110.120.0/24 10.110.120.1 lxd-net2-lr-lrp-net-1
ovn-nbctl lr-route-add lxd-net2-lr fd42:7832:3b4e:cffb::/64 fd42:7832:3b4e:cffb::1 lxd-net2-lr-lrp-net-1

This will then allow traffic to flow between networks without leaving the OVN subsystem.

Route tables (avoiding asymmetric routing for NIC routes)

Because LXD’s OVN implementation supports routing additional prefixes to ovn NICs by specifying ipv{n}.routes and/or ipv{n}.routes.external this could then result in Instance NICs being configured to create packets destined for the peer network but using a source address outside of the source network’s primary subnet. This will lead to asymmetric routing (where the return packet leaves the OVN subsystem) and cause unexpected behaviour when using stateful ACLs or external firewalls.

To avoid this LXD will identify all of the possible prefixes being used by the peered network and add static routes for those prefixes to the local virtual router pointing towards the peer connection (and vice versa on the peered network’s virtual router).

Any changes to the peered network’s prefixes will be automatically applied to the local router’s routing table. This allows the peered network to indirectly influence the routing table of the local router.

Future work: Route prefix filtering
A possible extension in the future would be to add a prefix filter setting to the local peer connection entry to only allow specific prefixes to be added to the routing table. This would ensure that if the target network later adds a NIC level route that conflicts with addressing inside the source network that these routes are not automatically exported to the source network which would cause network disruption. It would also allow for the ability to create peer connections to multiple networks that may contain some prefixes that conflict with each other but not the source network. In this way the source network can select which prefix is reachable over which peer connection, rather than potentially importing a set of conflicting prefixes from the multiple peer networks.

Mutual peering

It will be possible for peering relationships to be established between OVN networks in different LXD projects.

Because OVN subnets are not guaranteed to be unique (even within a single LXD deployment) it is possible for overlapping subnets to be used in multiple OVN networks. As such a peering relationship between OVN networks needs to be mutually agreed by both sides, and the peering validation process will check that the route prefixes being exchange will not cause conflicts.

Example work flow:

Create two OVN networks in each in different projects.

lxc network create ovn1 --type=ovn --project project1 \
    network=myuplink \
    ipv4.address=192.168.1.1/24

lxc network create ovn2 --type=ovn --project project2 \
    network=myuplink \
    ipv4.address=192.168.2.1/24

Initiate peer connection from ovn1 towards ovn2.
The initiator will have to know correct project and network name to succeed, if either are incorrect no error message will be returned. This is to avoid users in one project being able to enumerate existing projects or existing networks in another project.

lxc network peer create ovn1 mypeer-ovn2 project2/ovn2 --project project1
lxc network peer ls ovn1 --project project1
+-------------+-------------+---------------+---------+
| NAME        | DESCRIPTION | PEER          | STATE   |
+-------------+-------------+---------------+---------+
| mypeer-ovn2 |             | project2/ovn2 | PENDING |
+-------------+-------------+---------------+---------+

Confirm peer connection from ovn2 towards ovn1.
The user will have to know correct project and network name to succeed, if either are incorrect no error message will be returned.

lxc network peer create ovn2 mypeer-ovn1 project1/ovn1 --project project2
lxc network peer ls ovn2 --project project2
+-------------+-------------+---------------+---------+
| NAME        | DESCRIPTION | PEER          | STATE   |
+-------------+-------------+---------------+---------+
| mypeer-ovn1 |             | project1/ovn1 | CREATED |
+-------------+-------------+---------------+---------+

lxc network peer ls ovn1 --project project1
+-------------+-------------+---------------+---------+
| NAME        | DESCRIPTION | PEER          | STATE   |
+-------------+-------------+---------------+---------+
| mypeer-ovn2 |             | project2/ovn2 | CREATED |
+-------------+-------------+---------------+---------+

ACL considerations

It is hoped that OVN will eventually allow us to identify traffic going to/from a peer router port and reference that in ACL rules using the peer name. As such we will ensure that peer names are usable in ACL rules when prefixed with the special @ character (that already cannot be used in ACL names) to indicate a specific network port subject.

Peer names will follow the same naming restrictions as ACLs:

Be between 1 and 63 characters long
Be made up exclusively of letters, numbers and dashes from the ASCII table
Not start with a digit or a dash
Not end with a dash

As well as:

Must not be “internal” or “external” - this is so they won’t conflict with the reserved @internal and @external subjects.

Currently the ACL will classify traffic on the peer connection as @external as it does with traffic going to/from the uplink network.

Although OVN itself doesn’t support identifying traffic from the peer connection as a different specific port on the internal LAN (it all appears to come from the router’s port connected to the LAN), as the peer connection has a specific set of target prefixes associated with it, we could potentially create an ACL address set containing those prefixes. We would then need to ensure that traffic from those prefixes not coming from the peer connection was dropped and any traffic coming from an address outside of that address set through the peer connection was also dropped. At that point we could be confident that any packets matching a source address in the address set for the peer connection could only have come from the peer connection itself.

This has been tested to work using router policies in OVN.

E.g. This allows packets from lxd-net2-lr’s subnet 10.105.164.0/24 arriving at lxd-net1-lr’s peer router port, and drops all other traffic arriving at the port.

ovn-nbctl lr-policy-add lxd-net1-lr 100 "ip4.src == 10.105.164.0/24 && inport == \"lxd-net1-lr-lrp-net-2\"" allow
ovn-nbctl lr-policy-add lxd-net1-lr 99 "inport == \"lxd-net1-lr-lrp-net-2\"" drop

This provides the foundations for ensuring that packets arriving at a virtual peer router port match the prefixes expected for the peer, and equally allow ensuring that packets arriving from the external virtual router port (connected to the uplink network) do not come from prefixes expected to be coming from the peer connection. In this way a named ACL address set that references the peer connection name would be able to reliably enforce policies between networks.

API changes

For the network peers feature a new API extension will be added called network_peer with the following API endpoints and structures added:

Create and edit a network peer

POST /1.0/networks/<network>/peers
PUT /1.0/networks/<network>/peers/<name>

Using the following new API structures respectively:

type NetworkPeersPost struct {
	NetworkPeerPut `yaml:",inline"`

	// Name of the peer
	// Example: project1-network1
	Name string `json:"name" yaml:"name"`

	// Name of the target project
	// Example: project1
	TargetProject string `json:"target_project" yaml:"target_project"`

	// Name of the target network
	// Example: network1
	TargetNetwork string `json:"target_network" yaml:"target_network"`
}

type NetworkPeerPut struct {
	// Description of the peer
	// Example: Peering with network1 in project1
	Description string `json:"description" yaml:"description"`

	// Peer configuration map (refer to doc/network-peers.md)
	// Example: {"user.mykey": "foo"}
	Config map[string]string `json:"config" yaml:"config"`
}

Delete a network peer

DELETE /1.0/networks/<network>/peers/<name>

List network peers

GET /1.0/networks/network/peers
GET /1.0/networks/<network>/peers/<name>

Returns a list or single record (respectively) of this new NetworkPeer structure:

type NetworkPeer struct {
	NetworkPeerPut `yaml:",inline"`

	// Name of the peer
	// Read only: true
	// Example: project1-network1
	Name string `json:"name" yaml:"name"`

	// Name of the target project
	// Read only: true
	// Example: project1
	TargetProject string `json:"target_project" yaml:"target_project"`

	// Name of the target network
	// Read only: true
	// Example: network1
	TargetNetwork string `json:"target_network" yaml:"target_network"`

	// The state of the peering
	// Read only: true
	// Example: Pending
	Status string `json:"status" yaml:"status"`
}

CLI changes

There will be a new sub-command added to the lxc network command called peer.

E.g.

For managing peer relationships:

lxc network peer ls <network>
lxc network peer create <network> <peer name> <[target project/]target_network>
lxc network peer show <network> <peer name>
lxc network peer edit <network> <peer name>
lxc network peer set <network> <peer name> <key>=<value>...
lxc network peer unset <network> <peer name> <key>
lxc network peer get <network> <peer name> <key>
lxc network peer delete <network> <peer name>

Database changes

There will be two new tables added called networks_peers and networks_peer_config.

CREATE TABLE "networks_peers" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	network_id INTEGER NOT NULL,
	name TEXT NOT NULL,
	description TEXT NOT NULL,
	target_network_project TEXT NULL,
	target_network_name TEXT NULL,
	target_network_id INTEGER NULL,
	UNIQUE (network_id, name),
	UNIQUE (network_id, target_network_project, target_network_name),
	UNIQUE (network_id, target_network_id),
	FOREIGN KEY (network_id) REFERENCES "networks" (id) ON DELETE CASCADE
);

CREATE TABLE "networks_peers_config" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	network_peer_id INTEGER NOT NULL,
	key VARCHAR(255) NOT NULL,
	value TEXT,
	UNIQUE (network_peer_id, key),
	FOREIGN KEY (network_peer_id) REFERENCES "networks_peers" (id) ON DELETE CASCADE
);

Upgrade handling

As these are new features, no upgrade handling is required.

Further information

The target_network_project and target_network_name fields in the networks_peers table are used only during the initial peering process. Once both sides have mutually agreed the connection, the target_network_id field will be populated on both sides with the ID of the respective peered network, and the target_network_project and target_network_name fields will be cleared. This is so that in the future if we add the ability to rename ovn networks that the peerings will reflect the updated name (by looking it up via ID).

The state of the network peering will be derived from the value of target_network_id. If it is <=0 then the state is “Pending”, if it is >0 then it is “Created”.

A network peering will be considered as a “user” of the peered network, and so will prevent network deletion until the peering is deleted.

tomp · September 15, 2021, 4:14pm

Looking at other cloud providers today it seems they all have the concept of a network routing table, and require that after setting up the peer relationship that the routing tables on both sides of the relationship are updated to specify exactly which prefixes are routed between them.

This makes me wonder if perhaps we need to add a concept of network routes (e.g. lxc network route <network>) and have the admin of the OVN network configure them after setting up the peer relationship.

This would allow them to specify any far side NIC route prefixes (with the option of only using some/none of them as needed). It would also mean if peering with multiple networks, and one of them happened to have a NIC route prefix that conflicted with another peered network’s primary subnet, the admin could then select exactly which routes they want to go over which peering link.

This feels safer and more flexible than automatically exporting all NIC level prefixes to all peered networks, as that would allow far side to arbitrarily add routes to the peered networks, which seems like asking for trouble.

tomp · September 15, 2021, 4:14pm

@stgraber further to my research regarding mutual peering with other cloud providers today, do you think it would be safe enough to the lxc network peer create <network> <peer name> <[target project/]target_network> command an error if the combination of project and network name was not found (but not state to the user which aspect was not found).

This way you’d have to correctly guess both the project name and network name in order to enumerate the projects or networks.

Otherwise we probably need something like a peering UUID that can be generated, exchanged and then used to setup/validate the peering relationship.

tomp · September 16, 2021, 5:05pm

@stgraber ready for review - main question points are:

Is having to specify project name and network name for peer connection on both sides sufficient protection from cross-project enumation?
Is the propose route prefixes approach OK (it is similar to what the other cloud providers do) and enables more fine grained control over which prefixes are acceptable from the peer. Which prevents the peer from injecting other prefixes in the future.

stgraber · September 16, 2021, 9:45pm

Feels to me like restricting the exact subnets being imported/exported is something we could do through config keys or as an extra attribute of a peering later on with the default being to expose both sides as they are.

We could in theory also extend this to allow for subnet mapping (NAT) as it’s also something cloud providers often support, as much as I may hate it…

stgraber · September 16, 2021, 9:49pm

For the lxc network peer add case, I thought we discussed effectively making:

lxc network peer add lxdbr0 other-net blah/other-net

To always succeed but result in an entry in lxc network peer list lxdbr0 with a state of PENDING until such time as the other end has similarly added the peer.

This shouldn’t allow for any information leakage (neither side can tell what exists on the other) and should be pretty straightforward to implement.

Was there a problem with this approach?

stgraber · September 16, 2021, 9:53pm

tomp:

Route tables (avoiding asymmetric routing for NIC routes)

Because LXD’s OVN implementation supports routing additional prefixes to ovn NICs by specifying ipv{n}.routes and/or ipv{n}.routes.external this could then result in Instance NICs being configured to create packets destined for the peer network but using a source address outside of the source network’s primary subnet. This will lead to asymmetric routing (where the return packet leaves the OVN subsystem) and cause unexpected behaviour when using stateful ACLs or external firewalls.

To avoid this we will need to make available the ability to specify which route prefixes should be added to the source network’s routing table pointing toward the target peer network. This will allow the administrator of the source network to specify exactly which prefixes they want to be reachable over the peer connection.

This will also ensure that if the target network later adds a NIC level route that conflicts with addressing inside the source network that these routes are not automatically exported to the source network which would cause network disruption.

It also allows for the ability to create peer connections to multiple networks that may contain some prefixes that conflict with each other but not the source network. In this way the source network can select which prefix is reachable over which peer connection, rather tha potentially importing a set of conflicting prefixes from the multiple peer networks.

I think it’d be better to treat that as out of scope for now, with the initial implementation routing everything available on the target.

We have enough flexibility with the API and config here to be able to add such restrictions on top of it, effectively then restricting the peering to a specific set of subnets with each side having control on what they’d want to export and import.

tomp · September 16, 2021, 9:55pm

The only thing I was concerned about was that a typo in either project or net name would potentially be difficult to identify why the peering wasn’t working. But apart from that its fine, we can just store the requested details in a field as we discussed previously.

stgraber · September 16, 2021, 9:55pm

Should mention that internal and external will to be allowed to avoid conflicting with built-in names.

tomp · September 16, 2021, 10:00pm

Yeah so the ‘@’ prefix is currently used for the two special reserved names internal and external. The port groups belonging to members of an acl are just referenced directly without an ‘@’ so we can use the ‘@’ to indicate a particular ‘peer’ connection, with those two words being reserved.

This will then mean ACLs themselves can potentially use the same names as a peer connection.

stgraber · September 16, 2021, 10:00pm

Ok. I’m not very keen on allowing information leakage by allowing this one thing to go look at a project which they don’t have access to. Using some kind of UUID or the like as a token also would still have the same problem. To prevent potential brute-force, we’d need them completely random and would need to not disclose validity.

So it feels like this is a case for good documentation, basically having the network peering doc prominently mention that if after adding on both sides, the peering is still listed as PENDING to go take a very close look at the name of both project and network for any typo that may have been made.

tomp · September 16, 2021, 10:01pm

OK make sense.

stgraber · September 16, 2021, 10:03pm

Right, that’s what I was thinking. We’d use @NAME for traffic source/destination which isn’t coming from a network ACL on the current network.

Assuming enough OVN plumbing, I can see us adding @my-peer/its-acl though, so we can identity traffic coming or heading towards a specific set of instances based on ACL. But we’re currently a long way from having what we need in OVN for that

tomp · September 17, 2021, 10:36am

@stgraber I’ve made the changes we discussed now.

tomp · September 17, 2021, 12:11pm

@stgraber I’ve removed node_id column from networks_peers as peerings are not node specific.

stgraber · September 17, 2021, 9:09pm

Need lxc network peer show

stgraber · September 17, 2021, 9:14pm

Marked as approved

tomp · September 24, 2021, 10:59am

tomp:

CREATE TABLE "networks_peers" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	network_id INTEGER NOT NULL,
	name TEXT NOT NULL,
	description TEXT NOT NULL,
	target_network_project TEXT NULL,
	target_network_name TEXT NULL,
	target_network_id INTEGER NULL,
	UNIQUE (network_id, name),
	UNIQUE (network_id, target_network_project, target_network_name),
	UNIQUE (network_id, target_network_id),
	FOREIGN KEY (network_id) REFERENCES "networks" (id) ON DELETE CASCADE
);

CREATE TABLE "networks_peers_config" (
	id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	network_peer_id INTEGER NOT NULL,
	key VARCHAR(255) NOT NULL,
	value TEXT,
	UNIQUE (network_peer_id, key),
	FOREIGN KEY (network_peer_id) REFERENCES "networks_peers" (id) ON DELETE CASCADE
);

I’ve made some tweaks to the schema to prevent duplicate peerings (either pending or created) from one network to the same target network.