One of the cluster member stuck when I execute the incus network create aovn --type=ovn

cemzafer · June 29, 2025, 4:47am

Hello everyone,
Creating a cluster with ovn network gives me headache, I follow up the ovn with cluster documantation but there is something wrong. When I entered the following command
incus network create aovn --type=ovn
after those commands, the tnode1 member stuck and the cluster cant reached anymore. Can someone enlighten me?
Regards.

incus network create UPLINK --type=physical parent=end0 --target=tnode[1-3]

incus network create UPLINK --type=physical \
ipv4.ovn.ranges=172.16.10.100-172.16.10.200 \
ipv6.ovn.ranges=fd42:4242:4242:1000::100-fd42:4242:4242:1000::200 \
ipv4.gateway=172.16.10.1/24 \
ipv6.gateway=fd42:4242:4242:1000::1/64 \
dns.nameservers=8.8.8.8,1.1.1.1

root@tnode3:/etc/netplan# incus network show UPLINK
config:
  dns.nameservers: 8.8.8.8,1.1.1.1
  ipv4.gateway: 172.16.10.1/24
  ipv4.ovn.ranges: 172.16.10.100-172.16.10.200
  ipv6.gateway: fd42:4242:4242:1000::1/64
  ipv6.ovn.ranges: fd42:4242:4242:1000::100-fd42:4242:4242:1000::200
  volatile.last_state.created: "false"
description: ""
name: UPLINK
type: physical
used_by: []
managed: true
status: Created
locations:
- tnode2
- tnode3
- tnode1
project: default

ubuntu@tnode2:~$ incus network ls
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
|      NAME      |   TYPE   | MANAGED |      IPV4      |           IPV6           | DESCRIPTION | USED BY |  STATE  |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| UPLINK         | physical | YES     |                |                          |             | 0       | CREATED |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| aovn           | ovn      | YES     | 10.18.106.1/24 | fd42:64a5:c64:63eb::1/64 |             | 0       | ERRORED |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| br-int         | bridge   | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| end0           | physical | NO      |                |                          |             | 1       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| genev_sys_6081 | unknown  | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| lo             | loopback | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| ovs-system     | unknown  | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+

root@tnode3:/etc/netplan# incus network show aovn
config:
  bridge.mtu: "1442"
  ipv4.address: 10.18.106.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:64a5:c64:63eb::1/64
  ipv6.nat: "true"
  network: UPLINK
  volatile.network.ipv4.address: 172.16.10.100
  volatile.network.ipv6.address: fd42:4242:4242:1000::100
description: ""
name: aovn
type: ovn
used_by: []
managed: true
status: Errored
locations:
- tnode2
- tnode3
- tnode1
project: default

Cluster status

ubuntu@tnode2:~$ incus cluster ls
+--------+----------------------------+-----------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
|  NAME  |            URL             |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS  |                                   MESSAGE                                    |
+--------+----------------------------+-----------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| tnode1 | https://192.168.1.201:8443 | database        | aarch64      | default        |             | OFFLINE | No heartbeat for 19m26.037627423s (2025-06-29 07:31:25.81357484 +0300 +0300) |
+--------+----------------------------+-----------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| tnode2 | https://192.168.1.202:8443 | database        | aarch64      | default        |             | ONLINE  | Fully operational                                                            |
+--------+----------------------------+-----------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| tnode3 | https://192.168.1.203:8443 | database-leader | aarch64      | default        |             | ONLINE  | Fully operational                                                            |
|        |                            | database        |              |                |             |         |                                                                              |
+--------+----------------------------+-----------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+

Here is the monitor message.

WARNING[2025-06-29T07:40:03+03:00] [tnode3] Cluster member isn't responding      name=tnode1

P.S.
The tnode1 network interface switched to another name end0 → end1. I suppose, I fixed that issue, but still my newly created network looks like this.

+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
|      NAME      |   TYPE   | MANAGED |      IPV4      |           IPV6           | DESCRIPTION | USED BY |  STATE  |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| UPLINK         | physical | YES     |                |                          |             | 0       | CREATED |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| aovn           | ovn      | YES     | 10.18.106.1/24 | fd42:64a5:c64:63eb::1/64 |             | 0       | ERRORED |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| br-int         | bridge   | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| end0           | physical | NO      |                |                          |             | 1       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| genev_sys_6081 | unknown  | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| lo             | loopback | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+
| ovs-system     | unknown  | NO      |                |                          |             | 0       |         |
+----------------+----------+---------+----------------+--------------------------+-------------+---------+---------+

cemzafer · July 2, 2025, 7:12pm

I have looked almost anywhere but cant find any error, need any assist.
Nodes spec:
Linux tnode1 6.1.0-1025-rockchip #25-Ubuntu SMP Mon Aug 26 23:01:14 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Regards.

stgraber · July 2, 2025, 7:40pm

You should check that:

Each server can ping the two others
The the MTU is the same on all network interfaces
That all servers have their clock in sync (within 1s of each other)
That none of their disks are full

cemzafer · July 2, 2025, 8:11pm

Thanks for the post, recently some errors on the incus service and restart incus service on each hosts (tnode1, tnode2, tnode3). But now error message changed something following.

ubuntu@tnode1:~$ incus network create aovn --type=ovn
Error: Failed loading network: Failed to connect to OVN: failed to connect to 192.168.1.201: failed to open connection: unknown network protocol

Checked all the items on your list, but everything seems alright.
With sudo I can ping, but with ubuntu user, ping prints strange error.

ubuntu@tnode1:~$ ping 192.168.1.202
ping: socktype: SOCK_RAW
ping: socket: Operation not permitted
ping: => missing cap_net_raw+p capability or setuid?

And I have this message recently,

Jul 02 23:31:59 tnode1 incusd[1860]: time="2025-07-02T23:31:59+03:00" level=warning msg="Cluster member isn't responding" name=tnode3

cemzafer · July 2, 2025, 9:09pm

I think I have messed up the cluster. Removed the tnode3 from the cluster with this command.

incus cluster remove tnode3 --force

And want to add again and here replies like that.

Error: Failed to join cluster: This server is already clustered

ubuntu@tnode1:~$ incus cluster ls
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
|  NAME  |            URL             |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS |      MESSAGE      |
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| tnode1 | https://192.168.1.201:8443 | database-leader | aarch64      | default        |             | ONLINE | Fully operational |
|        |                            | database        |              |                |             |        |                   |
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| tnode2 | https://192.168.1.202:8443 |                 | aarch64      | default        |             | ONLINE | Fully operational |
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

oddjobz · July 2, 2025, 9:55pm

I could be wrong, but I think if you want to re-add a node after forcibly removing it, you need to wipe it down first. i.e. remove /var/lib/incus & restart incus. (obviously you lost anything from that node) … then do incus admin init to re-join it.

cemzafer · July 3, 2025, 1:57pm

I have installed cluster ovn on the incus vms on another host and same things occurred. I wonder, this is the right output or not. On the node1 the sudo ovn-nbctl show command outputs something like that.

switch eeec5605-e7eb-40d6-8ba6-955609a9b559 (incus-net3-ls-ext)
    port incus-net3-ls-ext-lsp-router
        type: router
        router-port: incus-net3-lr-lrp-ext
    port incus-net3-ls-ext-lsp-provider
        type: localnet
        addresses: ["unknown"]
switch 27307a0b-f930-44b3-8ef6-dbb100b91a7e (incus-net3-ls-int)
    port incus-net3-ls-int-lsp-router
        type: router
        router-port: incus-net3-lr-lrp-int
router a79ef8a2-3c7b-49aa-908f-8ddbee5fd5e9 (incus-net3-lr)
    port incus-net3-lr-lrp-int
        mac: "10:66:6a:e8:05:c8"
        networks: ["10.116.223.1/24", "fd42:68ce:ec0c:f649::1/64"]
    port incus-net3-lr-lrp-ext
        mac: "10:66:6a:e8:05:c8"
        networks: ["192.0.2.100/24"]
    nat 423a89fc-d052-4d2c-b44d-a10d793562d9
        external ip: "192.0.2.100"
        logical ip: "10.116.223.0/24"
        type: "snat"

But on the other nodes, tovn2 and tovn3, the output looks like that.

root@tovn2:~# sudo ovn-nbctl show
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()
root@tovn3:~# sudo ovn-nbctl show
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()

Is this normal or I missed something while configuring.
Regards.

oddjobz · July 3, 2025, 2:35pm

Ok, so what does your raft look like for a start, are all the nodes sync’d?

ovs-appctl -t /run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
ovs-appctl -t /run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound

It’s actually a bit confusing, but you can only directly query the database leader, so that query will only work on one node. To have ovn-nbctl work generically you need to pass it “–db” and your connection string. To simplify on my setup I add this to my .bashrc;

alias ovn-north="ovn-nbctl --db tcp:192.168.2.1:6641,tcp:192.168.2.2:6641,tcp:192.168.2.3:6641"
alias ovn-south="ovn-sbctl --db tcp:192.168.2.1:6642,tcp:192.168.2.2:6642,tcp:192.168.2.3:6642"

Then use “ovn-north show” … for example. As I understand it ovn-nbctl will take the connection string, work out which one is leader, then automatically query that particular node. (‘you’ can see which is leader from the ovs-appctl commands)