"Set up an Incus cluster on OVN"

Hi,

I’m just wondering if someone has gone through “Set up an Incus cluster on OVN” lately, and if so if you succeeded in setting it up.

I find the text to be somewhat confusing as it leaves a lot of things implicit:

  • Does this only work if the OVN network is set up before the initial clustering is done, or does it also work for an existing cluster?
  • Should OVN ips be used to tie the cluster together, or what is normally suggested as the default (“external” ip)?
  • The documentation states that you should use a “manually created unmanaged bridge” but only defers to Netplan – I’m running Incus on Debian and I assume this can be done with systemd-networkd – and regardless some examples would be nice to illustrate what the expected interface configurations are for the UPLINK and another interface (if expected) – how should/could the network configuration on the hosts look in general?
  • The documentation states that you should use “suitable IP ranges” when defining the UPLINK network – an example really wouldn’t hurt.
  • When defining this network, both ipv4.gateway and dns.nameservers are left completely implicit, it wouldn’t hurt to mention that the former can only be CIDR and the second cannot (again, examples would be nice)

If I could help out and just fix the documentation I would, but I’m on my third or so iteration of trying to get this to work now and at this point I’m not even sure I remember what’s working and what’s not between tries.

I would appreciate it hugely if someone that has time could just spin up three VMs, install Incus, cluster it and follow this guide and let me know if they manage, and if so with what interface and network configurations.

Cheers.

I should have time next week.

Have you tried using incus deploy?

I have looked at it briefly, and I haven’t found anything that clearly mismatches what I’m trying to do (apart from it using TLS), but obviously I’m missing something.

Here’s a short description of what I’m doing, if it’s any help:

# incus ls -c n,4,t
+--------+----------------------+-----------------+
|  NAME  |         IPV4         |      TYPE       |
+--------+----------------------+-----------------+
| incus1 | 172.17.1.10 (enp6s0) | VIRTUAL-MACHINE |
|        | 10.14.91.71 (enp5s0) |                 |
+--------+----------------------+-----------------+
| incus2 | 172.17.1.20 (enp6s0) | VIRTUAL-MACHINE |
|        | 10.14.91.98 (enp5s0) |                 |
+--------+----------------------+-----------------+
| incus3 | 172.17.1.30 (enp6s0) | VIRTUAL-MACHINE |
|        | 10.14.91.80 (enp5s0) |                 |
+--------+----------------------+-----------------+

These are using incusbr0 and incusbr1, where the latter is the 172-ips on enp6s0 that I’m aiming to use for OVN. That one has a static IP definition via systemd-networkd while the 10-ip comes from DHCP on incusbr0.

So I set up the cluster (communicating over the 10-ips), then fill out /etc/default/ovn-central on all nodes according to the docs, I then start the ovn-central service on all nodes and set the open_vswitch with encap-ip for each node.

I then create the UPLINK network targeting all nodes, followed by setting the ovn.ranges etc. This does not error, so I then set the northbound_connection value for Incus and do network create my-ovn --type=ovn

At this point, enp6s0 that previously had a static 172-ip loses its ip, goes down, and the Incus command hangs. This is the last 50 syslog lines:

# journalctl --no-pager --since=-20m | tail -n 50
Oct 29 11:13:01 incus1 ovsdb-server[1811]: ovs|00028|reconnect|ERR|tcp:172.17.1.30:6644: no response to inactivity probe after 2 seconds, disconnecting
Oct 29 11:13:01 incus1 ovsdb-server[1811]: ovs|00029|reconnect|INFO|tcp:172.17.1.30:6644: connection dropped
Oct 29 11:13:01 incus1 ovsdb-server[1811]: ovs|00030|reconnect|ERR|tcp:172.17.1.20:50054: no response to inactivity probe after 2 seconds, disconnecting
Oct 29 11:13:01 incus1 ovsdb-server[1811]: ovs|00031|reconnect|ERR|tcp:172.17.1.30:48302: no response to inactivity probe after 2 seconds, disconnecting
Oct 29 11:13:01 incus1 ovsdb-server[1806]: ovs|00029|reconnect|ERR|tcp:172.17.1.20:48970: no response to inactivity probe after 2 seconds, disconnecting
Oct 29 11:13:01 incus1 ovsdb-server[1806]: ovs|00030|reconnect|ERR|tcp:172.17.1.30:43052: no response to inactivity probe after 2 seconds, disconnecting
Oct 29 11:13:01 incus1 ovsdb-server[1806]: ovs|00031|raft|INFO|term 3: 1957 ms timeout expired, starting election
Oct 29 11:13:02 incus1 ovsdb-server[1806]: ovs|00032|reconnect|INFO|tcp:172.17.1.20:6643: connecting...
Oct 29 11:13:02 incus1 ovsdb-server[1806]: ovs|00033|reconnect|INFO|tcp:172.17.1.20:6643: connected
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00032|reconnect|INFO|tcp:172.17.1.20:6644: connecting...
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00033|reconnect|INFO|tcp:172.17.1.20:6644: connected
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00034|raft|INFO|rejecting term 2 < current term 3 received in append_request message from server 69d7
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00035|reconnect|INFO|tcp:172.17.1.30:6644: connecting...
Oct 29 11:13:02 incus1 ovsdb-server[1806]: ovs|00034|reconnect|INFO|tcp:172.17.1.30:6643: connecting...
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00036|reconnect|INFO|tcp:172.17.1.30:6644: connected
Oct 29 11:13:02 incus1 ovsdb-server[1806]: ovs|00035|reconnect|INFO|tcp:172.17.1.30:6643: connected
Oct 29 11:13:02 incus1 ovsdb-server[1806]: ovs|00036|raft|INFO|rejecting term 2 < current term 3 received in append_request message from server 91e9
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00037|raft|INFO|term 4: 1478 ms timeout expired, starting election
Oct 29 11:13:02 incus1 ovsdb-server[1811]: ovs|00038|raft|INFO|rejecting term 2 < current term 4 received in vote_reply message from server 9d42
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00037|reconnect|ERR|tcp:172.17.1.10:59148: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00038|raft|INFO|term 4: 1215 ms timeout expired, starting election
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00039|raft|INFO|rejecting term 2 < current term 4 received in vote_reply message from server 1bf7
Oct 29 11:13:03 incus1 ovsdb-server[1811]: ovs|00039|raft|INFO|rejecting term 3 < current term 4 received in vote_request message from server 9d42
Oct 29 11:13:03 incus1 ovsdb-server[1811]: ovs|00040|raft|INFO|rejecting term 3 < current term 4 received in append_request message from server 9d42
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00040|reconnect|ERR|tcp:172.17.1.30:43422: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:03 incus1 ovsdb-server[1811]: ovs|00041|raft|INFO|term 5: 1007 ms timeout expired, starting election
Oct 29 11:13:03 incus1 ovsdb-server[1811]: ovs|00042|raft|INFO|rejecting term 3 < current term 5 received in vote_reply message from server 69d7
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00041|raft|INFO|rejecting term 3 < current term 4 received in vote_request message from server 1bf7
Oct 29 11:13:03 incus1 ovsdb-server[1806]: ovs|00042|raft|INFO|rejecting term 3 < current term 4 received in append_request message from server 1bf7
Oct 29 11:13:04 incus1 ovsdb-server[1811]: ovs|00043|reconnect|ERR|tcp:172.17.1.10:58580: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:04 incus1 ovsdb-server[1806]: ovs|00043|raft|INFO|term 5: 1098 ms timeout expired, starting election
Oct 29 11:13:04 incus1 ovsdb-server[1806]: ovs|00044|raft|INFO|rejecting term 3 < current term 5 received in vote_reply message from server 91e9
Oct 29 11:13:04 incus1 ovsdb-server[1811]: ovs|00044|raft|INFO|server 9d42 is leader for term 5
Oct 29 11:13:04 incus1 ovsdb-server[1811]: ovs|00045|raft|INFO|rejecting append_request because previous entry 3,26 not in local log (mismatch past end of log)
Oct 29 11:13:05 incus1 ovsdb-server[1811]: ovs|00046|reconnect|ERR|tcp:172.17.1.30:50286: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1811]: ovs|00047|reconnect|ERR|tcp:172.17.1.20:54780: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1806]: ovs|00045|reconnect|ERR|tcp:172.17.1.20:34860: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1806]: ovs|00046|reconnect|ERR|tcp:172.17.1.30:43428: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1811]: ovs|00048|reconnect|ERR|tcp:172.17.1.20:57364: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1811]: ovs|00049|reconnect|ERR|tcp:172.17.1.30:49656: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1811]: ovs|00050|reconnect|ERR|tcp:172.17.1.10:47574: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:05 incus1 ovsdb-server[1806]: ovs|00047|raft|INFO|server 1bf7 is leader for term 5
Oct 29 11:13:05 incus1 ovsdb-server[1806]: ovs|00048|raft|INFO|rejecting append_request because previous entry 3,36 not in local log (mismatch past end of log)
Oct 29 11:13:06 incus1 ovsdb-server[1811]: ovs|00051|reconnect|ERR|tcp:172.17.1.30:58166: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovsdb-server[1806]: ovs|00049|reconnect|ERR|tcp:172.17.1.20:46220: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovsdb-server[1811]: ovs|00052|reconnect|ERR|tcp:172.17.1.20:54774: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovn-northd[1784]: ovs|00046|reconnect|ERR|tcp:172.17.1.10:6641: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovn-northd[1784]: ovs|00048|reconnect|ERR|tcp:172.17.1.10:6642: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovsdb-server[1806]: ovs|00050|reconnect|ERR|tcp:172.17.1.10:59154: no response to inactivity probe after 5 seconds, disconnecting
Oct 29 11:13:07 incus1 ovn-controller[1235]: ovs|00470|reconnect|ERR|tcp:172.17.1.20:6642: no response to inactivity probe after 5 seconds, disconnecting

After 10 minutes, there are no logs following this and the incus command is still hung.

Despite enp6s0 being DOWN and not having an ip at this point, I can still ping those 172-ips, maybe because it reaches the 10.14.91.1 gateway on the other interface or something, I’m not sure.

Nothing comes to mind at the moment. I will have time next week to dig into this topic in detail. It is something I have been wanting to do anyway.

In the mean time you might find some hints in this YouTube video. The last time I went through it, it still works. It is a bit old but the important details are still there.

1 Like

I’ll have a look and see if there’s anything that doesn’t match my notes, thanks.

I can also mention that I tried changing the on-VM interface configuration for enp6s0 (used by OVN) from static IP to DHCP, setting it with ipv4.address on the interface config for the VMs instead, but this didn’t help. At least from looking at the logs, OVN looks like it’s working up until I do incus network create my-ovn --type=ovn, which leads to the interface going down and the command hanging. If I run incus network list on another node in the cluster, I can see the network, listed as managed with IP-ranges but the state is “errored”. The logs indicate connection issues (unsurprisingly, since the interface is down).

Interestingly, enp6s0 is only down and lacking an IP on the VM where I try to create the network (incus1), it’s still up on the other two nodes despite them also showing the network as errored.

I think I finally managed to get it to work, and probably the key point is this from the doc regarding the incus network create UPLINK command, emphasis mine:

Uplink interface

A high availability OVN cluster requires a shared layer 2 network,
so that the active OVN chassis can move between cluster members
(which effectively allows the OVN router’s external IP to be
reachable from a different host).

Therefore, you must specify either an unmanaged bridge interface or
an unused physical interface as the parent for the physical network
that is used for OVN uplink. The instructions assume that you are
using a manually created unmanaged bridge.

So in the end what actually worked was using only one “regular” interface, then creating a bridge with systemd-networkd, and setting a static IP on the bridge interface, instead of using separate nics for incus and ovn:

2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP group default qlen 1000
    link/ether 00:16:3e:27:f7:89 brd ff:ff:ff:ff:ff:ff
3: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c6:35:89:92:48:8a brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.10/16 brd 172.17.255.255 scope global br0
       valid_lft forever preferred_lft forever
    inet6 fe80::c435:89ff:fe92:488a/64 scope link
       valid_lft forever preferred_lft forever

I then use br0 as the parent interface for the UPLINK network, which is also used by the Incus cluster.

I still don’t quite understand though what an “unused physical interface” actually means in terms of interface configuration, if one would want to do that instead.

1 Like