I am using lxd 5.11 with 3-node cluster created via
microcloud. Network is bridged, connected via multicast vxlan. Route is announced to router via bgp. in my example i use vxlan to simplify configuration but nodes could also have bridges connected to same physical switch and effect would be the same.
lxc network show vxlan config: bgp.peers.dom4.address: 10.0.1.1 bgp.peers.dom4.asn: "65000" bgp.peers.dom6.address: 2001:0db8:1001::1 bgp.peers.dom6.asn: "65000" bridge.mtu: "1300" ipv4.address: 10.8.8.1/24 ipv4.dhcp: "true" ipv4.nat: "false" ipv4.routing: "true" ipv6.address: 2001:0db8:1088::1/64 ipv6.nat: "false" tunnel.multicast.group: 18.104.22.168 tunnel.multicast.id: "240" tunnel.multicast.protocol: vxlan description: "" name: vxlan type: bridge
Thanks to the fact that bridged networks on cluster have the same MAC and IP addresses, we have a redundant anycast gateway solution the only problem is that ARP and Neighbour Discovery is broken thanks to the fact that IP and mac are the same and there are not any solution in place tgat is designed to work like that, for example vrrpd, keepalived, etc.
the problem is like bellow:
- router (10.0.1.1) tries to ping test2 (10.8.8.159)
- router check its route table, route to 10.8.8.0/24 via cl1 (10.0.1.29) learned from bgp
- cl1 receive the packet destination ip is on directly connected bridge
- bridge on cl1 send broadcast arp request to learn test2 mac address
- broadcast frame is forwarded to bridge on cl2 and then to test2
- test2 answer with unicast arp reply
- bridge on cl2 receive arp reply and don’t forward it because its destination address matches its own.
- ip neigbour table on cl2 is updated
- bridge on cl1 never receive arp reply so it can’t forward the ping.
- cl1 generate icmp (host unreachable) message to router
very similar situation occur for ipv6 but using ND instead of ARP.
to workaround the issue you have to manually add static entries to to ip neighbour on cluster nodes:
ip nei replace to 10.8.8.213 lladdr 00:16:3e:6f:e6:74 dev vxlan
Or broadcasting gratuitous arp’s from containers every few seconds, but this is obliviously bad idea.
Is there any more elegant solution to achieve layer 2 connectivity without doing NAT, losing ability to manage dhcp and dns records on lxd, and keeping migration of containers/vm without need of changing IP address? I mean, both IP and MAC addresses are already in shared lxd database, so why not feed it directly to cluster members arp cache?
On the other hand, could you explain what was rationale to force setting the same IP and MAC on all cluster nodes…