I am using lxd 5.11 with 3-node cluster created via
microcloud. Network is bridged, connected via multicast vxlan. Route is announced to router via bgp. in my example i use vxlan to simplify configuration but nodes could also have bridges connected to same physical switch and effect would be the same.
lxc network show vxlan
Thanks to the fact that bridged networks on cluster have the same MAC and IP addresses, we have a redundant anycast gateway solution the only problem is that ARP and Neighbour Discovery is broken thanks to the fact that IP and mac are the same and there are not any solution in place tgat is designed to work like that, for example vrrpd, keepalived, etc.
the problem is like bellow:
- router (10.0.1.1) tries to ping test2 (10.8.8.159)
- router check its route table, route to 10.8.8.0/24 via cl1 (10.0.1.29) learned from bgp
- cl1 receive the packet destination ip is on directly connected bridge
- bridge on cl1 send broadcast arp request to learn test2 mac address
- broadcast frame is forwarded to bridge on cl2 and then to test2
- test2 answer with unicast arp reply
- bridge on cl2 receive arp reply and don’t forward it because its destination address matches its own.
- ip neigbour table on cl2 is updated
- bridge on cl1 never receive arp reply so it can’t forward the ping.
- cl1 generate icmp (host unreachable) message to router
very similar situation occur for ipv6 but using ND instead of ARP.
to workaround the issue you have to manually add static entries to to ip neighbour on cluster nodes:
ip nei replace to 10.8.8.213 lladdr 00:16:3e:6f:e6:74 dev vxlan
Or broadcasting gratuitous arp’s from containers every few seconds, but this is obliviously bad idea.
Is there any more elegant solution to achieve layer 2 connectivity without doing NAT, losing ability to manage dhcp and dns records on lxd, and keeping migration of containers/vm without need of changing IP address? I mean, both IP and MAC addresses are already in shared lxd database, so why not feed it directly to cluster members arp cache?
On the other hand, could you explain what was rationale to force setting the same IP and MAC on all cluster nodes…
After using manually inserting entries to
ip neihhbour there still remain the issue of communicating between cluster node and containers on another cluster nodes. I will try some policy routing routing to mitigate this and come back soon with my findings.
Have you configured using OVN networks for this? As I think you may also have problems with DHCP, as it will potentially give out the same address to multiple instances.
In this setup you would define an uplink network that uses a physical connection to the shared L2 between cluster members and the router (10.0.1.0/24 in this case).
Then the OVN layer would create a virtual distributed router across all of the cluster members that uses a single IP on the uplink network. It will then manage automatically which physical LXD cluster member the uplink gateway port will be active on (and failover if one of them goes down).
Then the virtual OVN network would have the subnet 10.8.8.0/24, and the internal virtual OVN router would have the IP 10.8.8.1. It would provide distributed DHCP and DNS services to your instances.
Then LXD, using its BGP features, could announce the OVN network’s 10.8.8.0/24 subnet to your router with a next-hop address of the OVN’s virtual router address on the uplink port. Ingress traffic would then flow towards whichever LXD cluster member the OVN gateway port was active on.
For intra-network traffic, OVN will setup and manage geneve tunnels between the LXD cluster members.
DHCP shouldn’t be a problem. DHCP server should perform a conflict detection before it assign address to a client, but with abovementioned arp problems it would also don’t work. Unless it would consult LXD database that hold information about mac and IP addresses. I guess it does because it is possible to add static leases. I am not sure how your implementation of DHCP work under the hood.
I considered OVN and I was trying to actively avoid using it, as every tutorial that i found on Web was using nat, even a video above.
In lxd documentation it is also stated that
"A high availability OVN cluster requires a shared layer 2 network" and not all my cluster members have a shared L2 network. I am using BGP, in my example, because creating multihomed shared network, without nat, for vms/containers on cluster’s nodes located on different network segments, was actually the goal of my setup.
Also I have zero experience with OVN and I am not sure if it will be suitable for my goal.
If You don’t mind i have a few questions about OVN:
- Can I create OVN cluster without shared L2 and use router L3 instead?
- Does multicast work on inside ovn networks?
- Can OVN work without nat?
- Can OVN be connected to vxlan setup that is used also outside of lxd?
Hopefully OVN would be a solution for that.
I am very happy that lxd is growing into such a wide spectrum of solutions. Tho it would be amazing if creating a cluster allow us to set which parts of a cluster are shared and which are not. Or even if cluster nodes allow us to create one network that is shared and another that node specific. Something similar to projects features and size property on cluster storage pools. Anyway Thank you for considering my problem, I understand that it may be a little niche.
Can I create OVN cluster without shared L2 and use router L3 instead?
OVN needs a shared L2 for the uplink network so that it can migrate each virtual network’s router external IP to a different chassis if one cluster member goes down. As OVN will select a single chassis (per OVN network) to use for ingress/egress traffic to/from the uplink.
Does multicast work on inside ovn networks?
I’m not sure, I believe broadcast does, so it may do.
Can OVN work without nat?
Absolutely. There are various options for using routed IPs inside an OVN network.
Can OVN be connected to vxlan setup that is used also outside of lxd.
Depends on what the purpose is for. But the uplink network can be any unused physical interface.