VLAN's evaporating on container start

I have a new network model working on VLANS that seems to work well, then I noticed Incus can also manage VLAN’s, so naturally I thought this would be better than hand-configuring my network setup in systemd-networkd (!)

I’m clearly doing something wrong in this regard, as soon as I start a container attached to my new managed VLAN, the vlan seems to disappear!

# incus network create dhcp --type=physical parent=eth1 --target=worf
# incus network create dhcp --type=physical parent=eth1 --target=p400
# incus network create dhcp --type=physical vlan=100
Network dhcp created
# ifconfig eth1.100
eth1.100: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::6e6e:7ff:fe16:a598  prefixlen 64  scopeid 0x20<link>
        ether 6c:6e:07:16:a5:98  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19  bytes 3241 (3.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

All good, eth1.100 also shows up in the UI.
Now attach a container to the network.

# incus network attach dhcp demo
root@worf:~# ifconfig eth1.100
eth1.100: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::6e6e:7ff:fe16:a598  prefixlen 64  scopeid 0x20<link>
        ether 6c:6e:07:16:a5:98  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19  bytes 3241 (3.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Looks Ok, so now start the container …

# incus start demo
# ifconfig eth1.100
eth1.100: error fetching interface information: Device not found

What just happened?!
I’m seeing this in the logs;

Jun 21 11:32:26 worf systemd-networkd[334]: eth1.100: Link DOWN
Jun 21 11:32:26 worf avahi-daemon[776]: Interface eth1.100.IPv6 no longer relevant for mDNS.
Jun 21 11:32:26 worf systemd-networkd[334]: eth1.100: Lost carrier
Jun 21 11:32:26 worf avahi-daemon[776]: Leaving mDNS multicast group on interface eth1.100.IPv6 with address fe80::6e6e:7ff:fe16:a598.
Jun 21 11:32:26 worf avahi-daemon[776]: Withdrawing address record for fe80::6e6e:7ff:fe16:a598 on eth1.100.
Jun 21 11:32:26 worf avahi-daemon[776]: Withdrawing workstation service for eth1.100.
Jun 21 11:32:26 worf kernel: physwIhXug: renamed from eth1.100
Jun 21 11:32:26 worf kernel: eth0: renamed from physwIhXug

which doesn’t make a lot of sense to me … can anyone tell me what’s going on here, and even better, how the network still seems to be working without the vlan present … (?)

Interesting, if I create the network with a type “macvlan” as opposed to “physical” (which I’m guessing is the correct approach) it seems to behave as expected … so the would seem to be two problems, (1) user error, (2) allowing the user to create a physical device with a vlan id …

Notably the new VLAN doesn’t actually work, but that will be another story …

Physical network devices (equivalent of nictype=physical) are passed to the container on startup, that process makes them disappear from the host and limits their use to a single container as that container now owns the device until it stops.

Ahhh, ok, tvm. Not really what I was looking for then in terms of shared networking … :slight_smile:

If you want to share the same NIC between multiple containers, then you’ll want a bridge. If you create a VLAN-aware bridge then you can connect any VLAN to any container, and trunk any subset of VLANs to any container.

I posted a sample config from my system recently here.

Not necessarily - MACVLAN works as well. That is the setup I have here: A number of VLANs terminated on the Incus host (or more precisely hosts, since I run a cluster) in unconfigured devices (i.e. the host has a device for the VLAN, but does not assign it any IP of its own), and those devices are passed to the instances (both containers and VMs) as MACVLAN.

The effect is much the same as a bridge, but without needing to configure a bridge interface on the host.

True. I never really saw the point of macvlan though, unless your workloads are untrusted and you don’t want them to see each other.

The workloads (within each VLAN) can see each other - they just need a roundtrip to the router…

The setup is mostly historical, though - each of those VLANs was routed to dedicated hardware before I consolidated all of that on a single Incus cluster (or more precisely at that time: LXD - that was years before the Incus fork) and it was the easiest way to keep the existing network structure and especially all of the network rules already consolidated on the (existing) routers.

The next time I modernize that setup, I’ll probably go with OVN.

The workloads (within each VLAN) can see each other - they just need a roundtrip to the router…

I think roundtrip-to-the-router would only work if you assigned a separate VLAN (and hence a separate subnet) to each container. Although MACVLAN apparently has 5 different modes of operation, one of which is bridging, and one of which supports inter-container layer 2 connectivity via “hairpinning” on an upstream switch. All very complicated.

I found Stéphane’s summary here: Macvlan vs network bridge - #2 by stgraber

It seems MACVLAN is faster, but relies on hardware support from the NIC, so the number of MACVLAN interfaces you can create is limited by your hardware.

The next time I modernize that setup, I’ll probably go with OVN.

That’s what I thought ~ 5 months ago. Just be aware that it’s not always an upgrade and almost certainly not the path for everyone.

It would be great to use Incus managed VLAN’s, but they seem (for some reason) to be limited in how they work and I’ve not found a way to make them behave in a manner that would facilitate what I want to do.

I’ve reverted to using host VLANS on host bridges. So for each network I have a dedicated vlan on the host, then a dedicated local bridge on the host, then the Incus network treats the bridge as a physical device. For example;

Eth0.10 -> br-private-net (node local bridge) -> private-net (Incus network)

A little convoluted but it works with systemd-network and seems stable / performant.

This gets all my nodes talking on the “private-net” network. I then have an container called “dnsmasq” which straddles all VLAN’s, very simple config, gives out a different /16 depending on the interface the request comes in on. Set expiry to 1w and I no longer need to pin Ip’s. To my cost I have discovered that with or without OVN, pinning IP’s (overriding interfaces) can be problematic and from what I can see should be avoided.

I’m also running some iptables on the host which forward all default route traffic from containers to the host’s outgoing “gateway”, which means all containers on all machines can point at the same default route, but actual traffic leaving the cluster will always go through the local default gateway. (which means you don’t have to worry about dhcp tagging to set the default route to the local host)

Not being able to manage this all from within Incus is a shame, but it’s relatively straight-forward and manageable for moderate numbers of networks. If you regularly create and delete networks on the other hand it might be a bit laborious.

It’ll be a completely new installation anyway since I want to change quite a bit of stuff on both the underlying physical network topology and how I set up the physical servers. Mostly go from single switches/routers with cold standby in case of failure (and the attendant downtime) to a load balanced redundant setup for the network hardware (where a failure of a single component “only” loses bandwidth, but not functionality), and go from a “traditional” host OS installation on a local disk to an “appliance-style” image based setup (ideally, I’ll have the complete host OS in a single Linux UKI image which I can either drop onto an EFI partition or boot over the administrative network, and verify with standard UEFI tools in both cases).

Using OVN for managing the virtual network that runs on top of that setup is not that much of a stretch…

In my case, that would be mode 2 (VEPA), but with a router on the other end instead of a switch (well, physically there is a switch involved, but it’s built into the router hardware and configured to act as a bunch of individual ports). And it’s not relevant anyway because everything within each VLANs actually may talk freely - but that is for the router to decide in my existing setup and not for the Incus host…

Mostly go from single switches/routers with cold standby in case of failure (and the attendant downtime) to a load balanced redundant setup for the network hardware

Mmm, be interested to hear how it goes … my experience has been that when you cluster with OVN you increase the inter-dependency of nodes on each other which would be the opposite of what you’re looking for in terms of redundancy. I think part of my problem was using OVN-IC, but the over-arching problem was that when it goes wrong, it’s pretty spectacular. Dropping one node and having redundancy take up the slack, that’s my aim, clustering over vlan’s looks to me like the most robust path for this atm.

I want that redundancy in part because I already have two different components that use a variant of RAFT for distributing their shared state - namely, Incus itself and Ceph. Actually, I already have that - all nodes on my small three node cluster already have redundant 10 GBit links to all other nodes for all intra-cluster traffic (mainly because I found out that the integrated 1Gbit ports on the server hardware really are too slow for a production Ceph instance, and it was cheaper to buy three 10Gbit NICs with two ports each than to buy three 10 GBit NICs with one port each and a 10GBit switch). And that part will stay as is.

The rest is “just” more redundancy for the upstream connection of the cluster, and the basis for fully automated rollout of new network firmware (if the system survives having one component drop out at any time without disrupting service, I can just as well take one component down for maintenance whenever I want to and don’t have to bother with planning maintenace windows in advance).

It’s still unclear to me how that works.

Are we talking about multiple containers on the same VLAN in the same subnet? If so, to talk between themselves, one container would ARP for the IP address of the other, which means they need layer 2 broadcast reachability. Unless you’re doing proxy-ARP on the router?

I have a new configuration and while I have something else I’d like to try, this works amazingly well compared to the OVN setup. In my local cluster, I’m using host based VLAN’s with host based bridges on top, then creating a physical bridge in Incus. This works great locally, I run my own dnsmasq which straddles all networks and services dhcp requests.

Then for the remote Incus instance (my edge gateway) I’m using a GRE L2 tunnel to connect over a Wireguard VPN connection, onto a dummy bridge, then into the remote Incus. So I have the same /16 arping over all containers local and remote. The dnsmasq at the clustered end runs in a container and services dhcp for both local and remote containers.

I keep looking behind be to check nothing has taken a chunk out of my behind (this just looks to be ‘too easy’), but so far it looks pretty good. Networking is effectively ‘all’ managed by “systemd” rather than Incus, which means my ZFS lockup problems on container restarts seem to have gone away. My MTU problems have also gone and performance seems excellent.

Container => Net (phy) => Bridge => VLAN => GRE => WG => GRE => Bridge => Net (phy) => Container

… Just repeat for as many networks as you need. It’s relatively easy to automate, just create systemd-networkd files and systemcrl reload systemd-networkd and an incus network create

All the routing “just works”, the only quirk is that dnsmasq likes to hand out a single default route and getting it to customise that based on the host is ‘problematic’. (Same issue for OVN, all traffic wants to go out through a common gateway) Purely by accident I found a great INCUS based tweak to handle this;

Create a standard Incus bridge with a DHCP range that encompasses the range(s) you’re going to route over your VLAN, once created, just ignore it. What it does in the background is create a forward for itself that will send any non-local traffic out via NAT (!) i.e. this is what is creates in nft;

chain pstrt.internal {
		type nat hook postrouting priority srcnat; policy accept;
		ip saddr 10.0.0.0/8 ip daddr != 10.0.0.0/8 masquerade
	}

So traffic not destined for the local / shared arped network, will always go out of the local default route, rather than a common network default route. (I think this should also work for OVN) Whereas you can set up your own rules for this, kinda neat to let Incus do it …

Yes. And they have it, because they are all in the same layer 2 network on the router, and the router will send appropriate ARP responses (and IP6 NDP responses), which will reach the appropriate client because each container has its own MAC address.

Are you saying that the router is doing proxy ARP, i.e. responding to ARP broadcasts for addresses which are not its own?