Networking doesn't seem to work correctly in nested LXD hypervisor (all elements are using "routed" nictype)

Luken · February 25, 2021, 10:33pm

I might be missing something, but I have a weird issue with networking in nested hypervisor. I have three levels: 1. Ubuntu 20.04 (LXD hypervisor) -> 2. Ubuntu 20.04 in container (nested LXD hypervisor) -> 3. Debian 10 in container

First two levels work just fine. But the last one (Debian 10) doesn’t have access to the internet, or the local network.

Additionally, when I start the container in the nested hypervisor - the nested hypervisor starts to losing packets. I think that some rivalry with the container is going on, but I’m far from being a networking expert.

First level actually have more Debian 10 containers, with networking working just fine, so I thought that I figured it out, but for some reason, the same setup nested one level deeper doesn’t work. I believe that it has to do something with routing. I didn’t find anything on the internet that would describe such setup.

So first things first.

LXD 4.0.5

I will group logs and data by level.

1. Ubuntu 20.04 (LXD hypervisor)

luken@lxd-hypervisor:~$ lxc list
+-------------------+---------+----------------------+------+-----------+-----------+
|       NAME        |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+-------------------+---------+----------------------+------+-----------+-----------+
| hypervisor-nested | RUNNING | 192.168.7.204 (eth0) |      | CONTAINER | 0         |
+-------------------+---------+----------------------+------+-----------+-----------+
| # ... some other irrelevant containers
+-------------------+---------+----------------------+------+-----------+-----------+

luken@lxd-hypervisor:~$ lxc profile show hypervisor-nested
config:
  user.network-config: |
    version: 2
    ethernets:
      eth0:
        addresses:
        - 192.168.7.204/32
        nameservers:
          addresses:
          - 8.8.8.8
          search: []
        routes:
        - to: 0.0.0.0/0
          via: 169.254.0.1
          on-link: true
  user.user-data: |
    #cloud-config
    users:
      - name: luken
        gecos: ''
        primary_group: luken
        groups: "sudo"
        shell: /bin/bash
        sudo: ALL=(ALL) NOPASSWD:ALL
        ssh_authorized_keys:
         - <<<redacted>>>
description: Hypervisor Nested
devices:
  eth0:
    ipv4.address: 192.168.7.204
    nictype: routed
    parent: eth1
    type: nic
name: hypervisor-nested
used_by:
- /1.0/instances/hypervisor-nested

luken@lxd-hypervisor:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:be:4a:e8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0
       valid_lft 52880sec preferred_lft 52880sec
    inet6 fe80::a00:27ff:febe:4ae8/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:dd:13:72 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.200/24 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 2002:5306:912c:0:a00:27ff:fedd:1372/64 scope global dynamic mngtmpaddr 
       valid_lft 86363sec preferred_lft 86363sec
    inet6 fe80::a00:27ff:fedd:1372/64 scope link 
       valid_lft forever preferred_lft forever

 # ... some other, irrelevant veths in between

15: vethe184de76@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:5c:38:07:24:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.0.1/32 scope global vethe184de76
       valid_lft forever preferred_lft forever
    inet6 fe80::fc5c:38ff:fe07:242c/64 scope link 
       valid_lft forever preferred_lft forever

    luken@lxd-hypervisor:~$ ip r
    default via 192.168.7.1 dev eth1 
    10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 
    10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100 
    192.168.7.0/24 dev eth1 proto kernel scope link src 192.168.7.200 
    # ... Some other irrelevant routes related to other containers in between
    192.168.7.204 dev vethe184de76 scope link

   # Note: the default gateway is altered by me, the whole setup runs in Vagrant, so it was Vagrant's network's gateway by default. 10.0.0.0 stuff is Vagrant related.

2. Ubuntu 20.04 in container (nested LXD hypervisor)

root@hypervisor-nested:~# lxc list
+----------------+---------+----------------------+------+-----------+-----------+
|      NAME      |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |
+----------------+---------+----------------------+------+-----------+-----------+
| test-profile-1 | RUNNING | 192.168.7.240 (eth0) |      | CONTAINER | 0         |
+----------------+---------+----------------------+------+-----------+-----------+

root@hypervisor-nested:~# lxc info test-profile-1
Name: test-profile-1
Location: none
Remote: unix://
Architecture: x86_64
Created: 2021/02/25 21:48 UTC
Status: Running
Type: container
Profiles: default, test-profile-1
Pid: 5441
Ips:
  lo:	inet	127.0.0.1
  lo:	inet6	::1
  eth0:	inet	192.168.7.240	veth40a1b521
  eth0:	inet6	fe80::ecc9:a2ff:fea9:5f0	veth40a1b521
Resources:
  Processes: 6
  CPU usage:
    CPU usage (in seconds): 1
  Memory usage:
    Memory (current): 21.74MB
    Memory (peak): 69.40MB
  Network usage:
    eth0:
      Bytes received: 446B
      Bytes sent: 11.68kB
      Packets received: 5
      Packets sent: 45
    lo:
      Bytes received: 0B
      Bytes sent: 0B
      Packets received: 0
      Packets sent: 0

root@hypervisor-nested:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a6:59:18:27:53:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.7.204/32 brd 255.255.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a459:18ff:fe27:53a9/64 scope link 
       valid_lft forever preferred_lft forever
4: veth40a1b521@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:fd:66:72:a5:81 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 169.254.0.1/32 scope global veth40a1b521
       valid_lft forever preferred_lft forever
    inet6 fe80::fcfd:66ff:fe72:a581/64 scope link 
       valid_lft forever preferred_lft forever
root@hypervisor-nested:~# ip r
default via 169.254.0.1 dev eth0 proto static onlink 
192.168.7.240 dev veth40a1b521 scope link 

root@hypervisor-nested:~# lxc profile show test-profile-1
config: {}
description: 'Test profile #1'
devices:
  eth0:
    ipv4.address: 192.168.7.240
    nictype: routed
    parent: eth0
    type: nic
name: test-profile-1
used_by:
- /1.0/instances/test-profile-1

root@hypervisor-nested:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=43.6 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=43.2 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=43.3 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=116 time=43.0 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=116 time=43.5 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=116 time=43.0 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=116 time=43.2 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=116 time=43.3 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=116 time=43.0 ms
64 bytes from 8.8.8.8: icmp_seq=49 ttl=116 time=1056 ms
64 bytes from 8.8.8.8: icmp_seq=50 ttl=116 time=43.4 ms
64 bytes from 8.8.8.8: icmp_seq=51 ttl=116 time=44.5 ms
^C
--- 8.8.8.8 ping statistics ---
51 packets transmitted, 12 received, 76.4706% packet loss, time 50970ms
rtt min/avg/max/mdev = 42.955/127.738/1056.017/279.886 ms, pipe 2

# ^ losing packets when Debian 10 container (test-profile-1) is active

root@hypervisor-nested:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a6:59:18:27:53:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.7.204/32 brd 255.255.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a459:18ff:fe27:53a9/64 scope link 
       valid_lft forever preferred_lft forever
4: veth40a1b521@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:fd:66:72:a5:81 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 169.254.0.1/32 scope global veth40a1b521
       valid_lft forever preferred_lft forever
    inet6 fe80::fcfd:66ff:fe72:a581/64 scope link 
       valid_lft forever preferred_lft forever

root@hypervisor-nested:~# ip r
default via 169.254.0.1 dev eth0 proto static onlink 
192.168.7.240 dev veth40a1b521 scope link

3. Debian 10 in container (test-profile-1)

root@test-profile-1:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ee:c9:a2:a9:05:f0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.7.240/32 brd 255.255.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ecc9:a2ff:fea9:5f0/64 scope link 
       valid_lft forever preferred_lft forever

root@test-profile-1:~# ip r
default via 169.254.0.1 dev eth0 
169.254.0.1 dev eth0 scope link

root@test-profile-1:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 89ms

root@test-profile-1:~# ping 192.168.7.204
PING 192.168.7.204 (192.168.7.204) 56(84) bytes of data.
64 bytes from 192.168.7.204: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 192.168.7.204: icmp_seq=2 ttl=64 time=0.049 ms
64 bytes from 192.168.7.204: icmp_seq=3 ttl=64 time=0.041 ms
^C
--- 192.168.7.204 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 9ms
rtt min/avg/max/mdev = 0.038/0.042/0.049/0.008 ms

# ^ I can ping the hypervisor directly above though.

IP forwarding is of course enabled for both hypervisors.

What would be the correct setup in this case, that would allow the most deeply nested container to access the network and the internet?

tomp · February 26, 2021, 9:28am

Thanks for your well structured explanation of the problem.

So the issue here is that you’re using routed NIC type inside the nested hypervisor containers.
There’s nothing inherently wrong with this, but it won’t work automatically, here’s why…

When adding a routed NIC with a specific IP (e.g. 192.168.7.204 ) to a container LXD performs the following actions:

Creates a veth-pair of interfaces, one interface for the host and one for the container.
On the host side of the interface it adds a a special link-local IP address of 169.254.0.1 (it adds this same IP to the host-side interface for all routed NICs, as it is link-local they don’t conflict).
Adds a static device route on the host to the static IP 192.168.7.204 pointing to the veth interface on the host (this causes packets that arrive at the LXD host for the target IP to be forwarded down the veth interface so they arrive inside the container).
Adds a proxy ARP entry on the parent interface for the static IP (more on this later).
Inside the container it configures the static IP on the veth interface with a /32 subnet (meaning there are no other IPs directly reachable on that link without using the container’s routing table).
Inside the container it also adds a default route pointing to 169.254.0.1 on the container’s veth interface. This causes packets destined for any IP other that its own one to be forwarded to the veth interface with a next-hop of 169.254.0.1, which causes it to arrive back at the host.

When the packets arrive at the host, the host uses its own routing table to make a routing decision as to which interface to route the packets out of (e.g. if pinging 8.8.8.8 it will likely use its default route to forward the packets onto the external gateway).

Now lets think about the return packet (coming back from 8.8.8.8 destined to 192.168.7.204).
The upstream gateway is unlikely to know that 192.168.7.204 is specifically hosted by the LXD hypervisor (unless you’ve also manually added a static route to your gateway for that IP), so your gateway will not know which host to send the reply packets to 192.168.7.204.

To solve this, a protocol called ARP is used, your gateway will broadcast a packet onto the internal network asking “who has 192.168.7.204” expecting the LXD node to reply with its MAC address if it has that IP.

However the LXD host doesn’t actually have that IP bound to it (its inside the container). So normally it wouldn’t reply to that ARP request. But LXD also sets up a so-called ARP proxy entry for 192.168.7.204 on the host’s external interface (the one specified by parent when adding a routed NIC). This causes Linux to reply to the ARP request using the host’s own MAC address.

Now the gateway knows where to send the packets (onto the LXD host) and when they arrive, the LXD host uses the static route added to forward the reply packets into the container.

So with this understanding of how it works on one-level, lets now think about what happens if inside the container we nest another LXD hypervisor with another container.

A routed NIC is added to the nested container, so a static route, ARP proxy entry on the parent and IP address 169.254.0.1 on the veth interface is added on container hypervisor, and the nested container is setup with the IP address and a default route back to 169.254.0.1.

When the nested container pings 8.8.8.8 from its IP 192.168.7.240 it goes back to the container hypervisor via the default route 169.254.0.1. This in turn uses the container hypervisor’s routing table to forward to 169.254.0.1 on the nested hypervisors veth back the top level hypervisor host. The packets arrive at the top level hypervisor host and use its routing table to forward the packets onto the external network towards the gateway. So far so good.

Now for the reply packets, they arrive at your gateway and your gateway performs ARP to find out which host has 192.168.7.240. This is where the problems start. Your top level LXD host doesn’t have an ARP proxy entry for 192.168.7.240 (only the nested hypervisor has that) and so it doesn’t reply, and so the gateway doesn’t know who to send the reply packets to and they will be dropped.

Additionally, assuming the packets did arrive at the LXD host (perhaps via a static route on the gateway), the top-level LXD host doesn’t know how to route packets to 192.168.7.240 down the veth interface for 192.168.7.204, and so they will be dropped at the LXD host as well.

This explains why the nested container can ping the nested hypervisor but not the the internet or the top level hypervisor.

There are various ways to solve this, with or without using routed NICs. However it would be useful to understand what you’re trying to achieve and why you selected using routed NICs for it.

tomp · February 26, 2021, 10:17am

One way around this is to “publish” multiple IPs on the top level container, e.g.

lxc config device set hypervisor-nested eth0 ipv4.address=192.168.7.204,192.168.7.240

This will setup the static routes and proxy ARP entries on the top-level host for both addresses so that the top level host will reply to ARP and know how to route both addresses.

Then inside the hypervisor-nested container, remove the IP address 192.168.7.240 and use it with the nested container as you were doing already.

One thing I’ve noticed with this setup is the default next-hop address 169.254.0.1 causes problems when nesting because its setup on the container hypervisor’s veth side interface and this stops it replying to ARP requests from the physical host.

A way around this is to specify that the host-side address for the nested containers be different, e.g.

lxc config device set test-profile-1 eth0 ipv4.host_address=169.254.0.2

Luken · February 27, 2021, 12:06pm

@tomp Wow, thank you for this amazing explanation. The reason I went for the “routed” nictype everywhere is simply because it seemed to be the simplest and most clean to me (multiple IPs and single mac address - ipvlan tutorials were nowhere to be found yet when I started). Now I think it’s not a good solution for nested hypervisors, because it looks like configuration of both hypervisors become tightly coupled - top hypervisor has to be aware of IPs used by the nested hypervisor, it’s pretty ugly.

I didn’t try “bridged” nictype yet but if it works like it sounds (by bridging everything), then it could be actually the cleanest solution, but not ideal because of multiple mac addresses, which makes this setup less portable (wifi issues). ipvlan seems to have issues with Debian ( https://blog.simos.info/how-to-get-lxd-containers-obtain-ip-from-the-lan-with-ipvlan-networking/ ). I may settle with bridge if nothing else will work, but I see I need to try some other things.

If you could share what would be your favorite setup for nested hypervisors, it would be very helpful, as there is still very little information around about this topic.

tomp · February 28, 2021, 7:31am

You didn’t mention why you are using nested containers, I feel this is important to understand before proposing a solution.

Luken · February 28, 2021, 10:33am

@tomp I’m prototyping a CI/CD setup for web development. Nested hypervisor will be representing a single project, and will be responsible for spawning environments for integration testing (periodically, for each active feature branch). That’s why I want it to be an independent unit, that would be able to spawn new containers at will, without top hypervisor knowing what’s going on.

tomp · March 1, 2021, 9:47am

And these independent nested CI/CD environments need to support inbound connections from the external network (i.e they need to be represented on the external network with their own IPs rather than source NATted)?

The reason I ask is that if inbound connections into each nested container aren’t required, then you could use routed on the top-level instances to provide an external IP for each nested hypervisor (so the source of the traffic from each CI/CD environment is easily identifiable), and then inside the nested hypervisor use a LXD managed bridge network with source NAT enable to mask all outbound connections from the nested instances to the nested hypervisor’s IP.

This would also prevent nested instances from stealing IPs on the external network of other nodes (as they would be utilizing the routed NIC’s ability to prevent layer 2 broadcast onto the external network).

Luken · March 1, 2021, 10:50am

Yes, they need to support inbound connections. I’m creating DNS records for each created container in nested hypervisor. The nested hypervisor will probably have a webserver installed, listing all currently running containers with some metadata (like when they were created). The point being that all feature branches should be also manually tested before merging, so they should be readily available for testing.

Would ipvlan even help in this case, or would we have the same situation as with routed nictype?

tomp · March 1, 2021, 2:31pm

The ipvlan NIC type will have the same restrictions as routed, in that the LXD host needs to know which IPs are being routed to the container (and thus nested hypervisors won’t be able to communicate up to the top-level hypervisor about its own instance’s IPs). In additional ipvlan (like macvlan) don’t allow the instances to communicate with their immediate hypervisor host (which may or may not be required in your situation).

If the instances you are running inside the nested hypervisors are trusted and you truly need your nested containers to have full access to the external network, with the ability to hijack IPs (by way of ARP spoofing etc) then you’re best bet is to use bridging all the way down, and disable LXD’s built in DHCP and DNS server (leaving it up to the external network’s DHCP and DNS server).

First create an unmanaged bridge (e.g. br0) and move your top-level hypervisor’s external IP config onto it (and off of eth1 interface) then connect eth1 to br0. E.g. using

https://netplan.io/examples/#configuring-network-bridges

Then in your first level container, use a bridged NIC type with parent set to br0.

Next, inside your first level container, repeat the process for the eth0 interface (i.e create a br0 interface inside the container, with either statically defined IP config or using DHCP) and then connect eth0 to br0.

Finally, for the nested containers, use a NIC type of bridged and a parent of br0, and that should join them to the external network.

This won’t work if the external interface on the top-level host is a wireless card, as the MAC address is used as part of the authentication.

Luken · March 1, 2021, 7:49pm

Thank you for your help. I really appreciate it. You made me understand how Linux networking work much more. I think I will live with the fact that the top hypervisor need to be aware of the list of IPs used by the nested one, it’s not a show stopper, and saves some issues with multiple MAC addresses, especially when I want to generate containers dynamically. I tried to run nested ipvlan, just to test it but it didn’t work at all. Nested hypervisor didn’t even get created. So I will settle with nested routed which works just fine with your hints. Thank you once more .