Usecase for multiple interfaces in a single bridge

mdione · October 25, 2022, 4:40pm

This is going to be a long ride, so please get your favorite beverage and relax.

We run our product’s CI on containers run on top of LXD. The product is tested on clusters of different sizes and, because of some features of the product, different amounts of NICs. This mean we can have 1, 3, 6, 12, 60 nodes with 1, 2 or 3 NICs configured. Technically we could have any amount, but I think we never used more than 3. All these node interfaces live in separate IP networks. These environments are created, some tests are run, and then destroyed, until the next CI run comes along.

We need a lot of flexibility because this same setup is used to reproduce setups at clients, which can have quite arbitrary network layouts, and sometimes we chase bugs related to that.

All this is run on a single host. We know of LXD clustering, but because of the way we lean on dsnmasq to do internal DNS (our product is quite sensible to this, and it’s needed to run the tests) it’s currently not possible to use them.

We also don’t do this ourselves, we use Terraform on top of LXD. We do this because we thought we could create a single description of clusters and run them on different backends, like LXD or Amazon EC2. We never go to the point where this happened, so currently we only run on LXD. We could remove Terraform, but since it’s interface is quite simple (just a text file that we can generate with a template), we’re keeping it. I’m in favor of lean products, but I still think we can get to the point we run on EC2.

Anyways…

For each cluster we define at least one LXD bridge network. All the interfaces of the nodes are connected here. We do this this way because so far we could, and requires less definitions.

The LXD hosts also have another LXD bridge network that we use to provision the nodes of the clusters through an extra interface. We also add an interface to each node where we run IPMI. That means that if we create a cluster with 2 declared interfaces per node, we add an extra one for provisioning and another one for IPMI, ending with 4 interfaces in total. These two extra interfaces are attached to the host’s bridge. We use the provisioning interface to run Ansible, but that’s almost irrelevant here

That per cluster bridge is also the one handling the cluster’s internal DNS. We configure this via dnsmasq.raw like this:

        "raw.dnsmasq" = <<EOF
no-hosts
address=/s3-admin.singlenode.cloudian.eu/10.11.1.151
address=/cmc.singlenode.cloudian.eu/10.11.1.151
address=/iam.singlenode.cloudian.eu/10.11.1.151
address=/s3-sqs.singlenode.cloudian.eu/10.11.1.151
address=/s3-eu-1.singlenode.cloudian.eu/10.11.1.151
address=/the-one.singlenode.cloudian.eu/10.11.1.151
address=/s3-website-eu-1.singlenode.cloudian.eu/10.11.1.151
address=/singlenode/10.11.1.151

That’s for a single node cluster; if we had more nodes, there would be an address for each. If the nodes have more than one interface, we only list one address for them.

This morning 5.7 landed on our CI machines and we started getting Error: failed to start container (singlenode): Failed start validation for device "ipmi": Instance DNS name "singlenode" already used on network. The node’s interfaces are declared like this in Terraform:

      + device {
          + name       = "eth0"
          + properties = {
              + "name"    = "eth0"
              + "nictype" = "bridged"
              + "parent"  = "singlenode"
            }
          + type       = "nic"
        }
      + device {
          + name       = "ipmi"
          + properties = {
              + "name"    = "ipmi"
              + "nictype" = "bridged"
              + "parent"  = "lxd-provision"
            }
          + type       = "nic"
        }
      + device {
          + name       = "provision"
          + properties = {
              + "name"    = "provision"
              + "nictype" = "bridged"
              + "parent"  = "lxd-provision"
            }
          + type       = "nic"
        }

Here we see a single interface eth0 linked to the per cluster network, and the IPMI and provision interfaces linked to the per host interface. LXD 5.7 is complaining about these ones. It’s nice that we can name them whatever we want and kinda make sure they won’t clash with real world interfaces.

Maybe the solution for us would be to start building one bridge per cluster interface, but @tomp asked me to post this, so here we are.

tomp · October 25, 2022, 7:00pm

Yes using one LXD managed bridge per instance NIC type would work.

I am interested in seeing the output of lxc network show lxdbr0 as well as an example of lxc config show <instance> --expanded for an instance, as well as understanding how the NICs get configured inside the instance?

Perhaps there is a combination of settings that would allow us to safely relax the new check.

mdione · October 25, 2022, 9:03pm

This is the per host LXD bridge network we use for provision, where both the ipmi and provision interfaces are attached:

22:49 $ lxc network show lxd-provision
config:
  dns.mode: none
  ipv4.address: 10.46.239.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:a318:d7ae:13de::1/64
  ipv6.nat: "true"
  raw.dnsmasq: |-
    dhcp-ignore-names
    local-ttl=4938790
description: ""
name: lxd-provision
type: bridge
used_by:
- /1.0/instances/c1r1dc2n-node1
- /1.0/instances/c1r1dc2n-node1
- /1.0/instances/c1r1dc2n-node2
- /1.0/instances/c1r1dc2n-node2
managed: true
status: Created
locations:
- none

This is the per cluster LXD bridge network were we attach the interfaces we use to test our product. In this case we have 3 interfaces per node, and 2 nodes:

22:51 $ lxc network show c1r1dc2n
config:
  dns.domain: 1r1dc2n.cloudian.eu
  dns.search: 1r1dc2n.cloudian.eu
  ipv4.address: 10.21.1.254/24
  ipv4.nat: "false"
  ipv6.address: fd42:5944:b8ba:cfe9::1/64
  ipv6.nat: "true"
  raw.dnsmasq: |
    no-hosts
    address=/admin.1r1dc2n.cloudian.eu/10.21.1.151
    address=/cmc.1r1dc2n.cloudian.eu/10.21.1.151
    address=/iam.1r1dc2n.cloudian.eu/10.21.1.151
    address=/s3-sqs.1r1dc2n.cloudian.eu/10.21.1.151
    address=/s3-eu-1.1r1dc2n.cloudian.eu/10.21.1.151
    address=/s3-website-eu-1.1r1dc2n.cloudian.eu/10.21.1.151
    address=/c1r1dc2n-node1/10.21.1.151
    address=/c1r1dc2n-node2/10.21.1.152
description: ""
name: c1r1dc2n
type: bridge
used_by:
- /1.0/instances/c1r1dc2n-node1
- /1.0/instances/c1r1dc2n-node1
- /1.0/instances/c1r1dc2n-node1
- /1.0/instances/c1r1dc2n-node2
- /1.0/instances/c1r1dc2n-node2
- /1.0/instances/c1r1dc2n-node2
managed: true
status: Created
locations:
- none

Here’s one of the instances:

22:54 $ lxc config show c1r1dc2n-node1 --expanded
architecture: x86_64
config:
  boot.autostart: "true"
  image.architecture: x86_64
  image.description: Centos 7 x86_64 (20190325_07:08)
  image.name: centos-7-x86_64-default-20190325_07:08
  image.os: centos
  image.release: "7"
  image.serial: "20190325_07:08"
  image.variant: default
  limits.cpu: "4"
  limits.memory: 8GB
  raw.lxc: |
    lxc.cgroup.devices.allow = c 10:137 rwm
    lxc.cgroup.devices.allow = b 7:* rwm
    lxc.cgroup.devices.allow = c 10:237 rwm
  security.nesting: "false"
  security.privileged: "true"
  user.access_interface: provision
  user.cloudian.installer: "true"
  user.network-config: |+
    version: 1
    config:
    - name: provision
      type: physical
      subnets:
      - type: dhcp

    - name: eth0
      type: physical

    - name: eth1
      type: physical

    - name: eth2
      type: physical

    - name: ipmi
      type: physical

  user.user-data: |
    #cloud-config

    runcmd:
    - sed -i 's/session.*required.*pam_loginuid.so/#session\trequired\tpam_loginuid.so/' /etc/pam.d/*
    - sed -i 's/session.*required.*pam_limits.so/#session\trequired\tpam_limits.so/' /etc/pam.d/*

    locale: en_US.UTF-8
    timezone: Europe/Amsterdam

    users:
    - name: root
  volatile.cloud-init.instance-id: 29db7179-6461-44bc-9d0a-4b377ba230a6
  volatile.eth0.host_name: vethdc771332
  volatile.eth0.hwaddr: 00:16:3e:73:80:78
  volatile.eth1.host_name: veth8c5bad8d
  volatile.eth1.hwaddr: 00:16:3e:10:8c:17
  volatile.eth2.host_name: vethda7e79cf
  volatile.eth2.hwaddr: 00:16:3e:05:98:f6
  volatile.idmap.base: "0"
  volatile.idmap.current: '[]'
  volatile.idmap.next: '[]'
  volatile.ipmi.host_name: vethce19297e
  volatile.ipmi.hwaddr: 00:16:3e:a9:4b:9d
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.provision.host_name: veth350cc1bf
  volatile.provision.hwaddr: 00:16:3e:57:ee:f8
  volatile.uuid: 229ca474-74f8-41bf-86e5-a40f1bc0e44c
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: c1r1dc2n
    type: nic
  eth1:
    name: eth1
    nictype: bridged
    parent: c1r1dc2n
    type: nic
  eth2:
    name: eth2
    nictype: bridged
    parent: c1r1dc2n
    type: nic
  ipmi:
    name: ipmi
    nictype: bridged
    parent: lxd-provision
    type: nic
  provision:
    name: provision
    nictype: bridged
    parent: lxd-provision
    type: nic
  root:
    path: /
    pool: local
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

Good thing I mentioned Ansible, then provision is configured to do DHCP via cloud-init. ipmi is currently left unconfigured. All the rest are configured with Ansible via the provision interface. The IPs are important to our product, so we need to control this as much as possible, specially when replicating client clusters.

mdione · October 25, 2022, 9:16pm

We said this on IRC:

[16:38:31] question would be: why couldn’t a DNS name point to different interfaces of the same machine?
[16:45:53] StyXman: this isnt something the underlying dnsmasq server supports AFAIK
[16:46:16] As the DNS name is configured on a per dhcp reservation
[16:46:36] hmmmm, I think not quite
[16:47:38] yes, either the client or the dhcp server can define a name for the lease, but I think dnsmasq merges it with whatever else you define with address and I think other settings
[16:52:05] StyXman: by default the instance name, not dhcp name is configured in [dnsmasq]
[16:52:51] but you mentioned dhcp reservation, maybe I misunderstood what you meant?

Unluckily this last question was lost in the exchange, I guess. Meanwhile, I read dnsmasq’s manpage:

To give multiple addresses or both IPv4 and IPv6 addresses for a domain, use repeated --address flags

… which is the flag we’re using with raw.dnsmasq up there (but we’re not using it for configuring all node’s interfaces to the same name).

tomp · October 25, 2022, 9:18pm

Ah so we could relax the check if dns.mode is none then, as there’s no risk of dns name conflict then.

mdione · October 25, 2022, 9:18pm

This is how the nodes look like after being fully configured:

23:16 $ lxc list
+----------------+---------+---------------------------+------+-----------+-----------+
|      NAME      |  STATE  |           IPV4            | IPV6 |   TYPE    | SNAPSHOTS |
+----------------+---------+---------------------------+------+-----------+-----------+
| c1r1dc2n-node1 | RUNNING | 10.46.239.142 (provision) |      | CONTAINER | 0         |
|                |         | 10.21.3.1 (eth2)          |      |           |           |
|                |         | 10.21.2.151 (eth1)        |      |           |           |
|                |         | 10.21.1.151 (eth0)        |      |           |           |
+----------------+---------+---------------------------+------+-----------+-----------+
| c1r1dc2n-node2 | RUNNING | 10.46.239.143 (provision) |      | CONTAINER | 0         |
|                |         | 10.21.3.2 (eth2)          |      |           |           |
|                |         | 10.21.2.152 (eth1)        |      |           |           |
|                |         | 10.21.1.152 (eth0)        |      |           |           |
+----------------+---------+---------------------------+------+-----------+-----------+

All functional interfaces can ping the ones in their respective networks.

tomp · October 25, 2022, 9:19pm

Out of interest what does ip r output look inside the container?

mdione · October 25, 2022, 9:21pm

[root@c1r1dc2n-node1 ~]# ip r
default via 10.46.239.1 dev provision 
10.21.1.0/24 dev eth0 proto kernel scope link src 10.21.1.151 
10.21.2.0/24 dev eth1 proto kernel scope link src 10.21.2.151 
10.21.3.0/24 dev eth2 proto kernel scope link src 10.21.3.1 
10.46.239.0/24 dev provision proto kernel scope link src 10.46.239.142 
169.254.0.0/16 dev ipmi scope link metric 1045 
169.254.0.0/16 dev provision scope link metric 1049

I wonder why there is a route on ipmi.

tomp · October 25, 2022, 9:27pm

I see so you’re running multiple subnets on the same bridge. I suppose that works

The 169. Route is the default link local subnet for unconfigured interfaces.

Anyway I think if we relax the name check so it doesnt occur for networks that have dns.mode none then that would be fine.

For the other network could you use that setting also?

mdione · October 25, 2022, 9:38pm

Also:

Note that /etc/hosts and DHCP leases override this for individual names

We’re not using the former (no-hosts in raw.dnsmasq) but cloud-init sets up dhclient with the instance name:

/sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient--provision.lease -pf /var/run/dhclient-provision.pid -H c1r1dc2n-node1 provision

Still, the name stays associated to our interface:

[root@c1r1dc2n-node1 ~]# ping -c 1 c1r1dc2n-node1
PING c1r1dc2n-node1 (10.21.1.151) 56(84) bytes of data.
64 bytes from c1r1dc2n-node1 (10.21.1.151): icmp_seq=1 ttl=64 time=0.038 ms

mdione · October 25, 2022, 9:48pm

“hsbr”?

That’s because we use dhcp-ignore-names in raw.dnsmasq.

I guess so, yes.

tomp · October 25, 2022, 9:54pm

Sorry. I meant Have, I’m on phone as middle of the night here.

I’m prepared to relax the check for dns.mode none networks. As in that case there can be no dns name conflict.

But for the other network you have 3 nics connected to the same network (all of which could potentially use dhcp using the same name, but in your case 2 out of the 3 are statically configured with additional subnets that lxd doesn’t know about).

I dont think we should explore the route of returning multiple ips (of the same protocol) for a single dns name as it would be confusing to use for reaching services on those names. As the services may be listening on a specific ip.

The proper way to set this up is to have 3 managed networks. Then you wouldn’t get name conflicts and you wouldn’t be running multiple subnets over the same bridge (which is rather unorthodox as you dont have layer 2 separation between subnets).

The other approach is to use the vlan settings on the bridged nic as we support running multiple vlans over the same bridge (with no dhcp or gateway).

We could then relax the dns name check to only those NICs in the default vlan that can use the networks dns.

See vlan filtering in LXD 4.2 has been released

mdione · October 26, 2022, 6:10am

Yes, all 3 are static, one in the same IP network as the bridge they connect to (so they can use the DNS in that IP), and the rest (1, 2, any) have other networks (or not; you know, clients… like me :^)

But I get your point that if you open this door, then you have to think of DHCP there or not. Technically, if any interface is getting DHCP from here, it’s most probably from the dnsmasq running there, so there is no possibility that they belong to a network different to the one dnsmasq is listening, right? If they’re getting it from some other DHCP server on the same bridge, then the user is probably (ab)using the system (just like I am, I think) and should know what it’s doing.

I think I’ll go the ‘one switch per interface’ route, but thanks for trying to accommodate us.

tomp · October 26, 2022, 11:45am

This PR implements those changes: