Setting up IncusOS cluster with a separate internal network

I am trying to configure an IncusOS cluster with the cluster network cluster.https_address that is different from the API network core.https_address. I am trying to follow this tutorial, but have some problems with validation of x509 certificates. All servers are configured with one public IPv4 network 10.0.0.1/24 that is used for Incus clients and another private network fd00:10::1/64 that I want to use for internal cluster communication. (The server names and IP addresses have been modified)

% incus cluster join my-cluster: server2:
What IP address or DNS name should be used to reach this server? [default=10.0.0.12]: fd00:10::12
What member name should be used to identify this server in the cluster? [default=4c4c4544-0043-5410-8033-c8c04f503034]: server2
All existing data is lost when joining a cluster, continue? (yes/no) [default=no] yes
Error connecting to existing cluster member "[fd00:10::11]:8443": Get "https://[fd00:10::11]:8443": Unable to connect to: [fd00:10::11]:8443 ([dial tcp [fd00:10::11]:8443: i/o timeout])
Error: Failed to join cluster: Failed to setup cluster trust: Failed to add server cert to cluster: Post "https://[fd00:10::11]:8443/1.0/certificates": tls: failed to verify certificate: x509: cannot validate certificate for fd00:10::11 because it doesn't contain any IP SANs

How do I resolve this problem? Do I need to provide certificates for the internal addresses with the installation seed?

Initially, I was trying to crate a cluster using operations center, but this also fails when I try to use a network with role cluster that is different from the network with role management. I am not sure how else to specify which network should be used for cluster.https_address.

Since I cannot change cluster.https_address after the cluster is created, I need to provide the final settings during creating of the cluster.

So, on my first node server1 I tried to change core.https_address from the default value :8443 to the IP of the node 10.0.0.11:8443 and now I get a different error when trying to join the cluster:

Error: Failed to join cluster: Failed to setup cluster trust: Failed to add server cert to cluster: Post "https://[fd00:10::11]:8443/1.0/certificates": Unable to connect to: [fd00:10::11]:8443 ([dial tcp [fd00:10::11]:8443: connect: connection refused])

OK, I restarted the incus application on the first node:

incus admin os application restart incus

and I now get the original error message:

tls: failed to verify certificate: x509: cannot validate certificate for fd00:10::11 because it doesn't contain any IP SANs

Can you run incus cluster list my-cluster:?

I’ve seen that join error in the past when the CLI doesn’t have a direct route to the joining server’s address before, but that didn’t actually prevent it from joining for me.

I see. Does it mean that cluster.https_address of server1 must be reachable from the client from which I run the incus commands? In my case, the network fd00:10::1/64 is completely isolated. (It is managed by a switch without external internet connectivity).

I already wiped my cluster. I now repeated the installation from scratch and I think, I managed to get the cluster formed despite the final error message:

% incus config set server1: cluster.https_address=[fd00:10::11]:8443 # internal network
% incus cluster enable server1: server1
Clustering enabled
% incus remote add my-cluster 10.0.0.11:8443 # use address reachable from the client
Certificate fingerprint: 198b620cb1b2f3b6aae5085c9e83bd8204ca110ab55091b9e496d55c32514866
ok (y/n/[fingerprint])? y
% incus remote rm server1
% incus cluster list my-cluster:
+---------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME    |            URL             |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS |      MESSAGE      |
+---------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| server1 | https://[fd00:10::11]:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|         |                            | database        |              |                |             |        |                   |
+---------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
% incus cluster join my-cluster: server2:
What IP address or DNS name should be used to reach this server? [default=10.0.0.12]: fd00:10::12 # !! This will be set as `cluster.https_address` of `server2` !! 
What member name should be used to identify this server in the cluster? [default=4c4c4544-0044-5410-8033-b2c04f503034]: server2
All existing data is lost when joining a cluster, continue? (yes/no) [default=no] yes
Error connecting to existing cluster member "[fd00:10::11]:8443": Get "https://[fd00:10::11]:8443": Unable to connect to: [fd00:10::11]:8443 ([dial tcp [fd00:10::11]:8443: i/o timeout])
% incus cluster list my-cluster:
+---------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME    |            URL             |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS |      MESSAGE      |
+---------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| server1 | https://[fd00:10::11]:8443 | database-leader  | x86_64       | default        |             | ONLINE | Fully operational |
|         |                            | database         |              |                |             |        |                   |
+---------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| server2 | https://[fd00:10::12]:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+---------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
% incus cluster join my-cluster: server3:
# ...

Despite the error, the cluster appears to be operational: I was managed to launch containers after adding the local ZFS pool and the bridge network.

Last time I set cluster.https_address of server2 directly before joining the cluster, which probably resulted in problems with certificates.

Would it still be possible to make incus cluster join to work without errors when cluster.https_address is not reachable from the incus client? Also, I guess, setting up a cluster from the operations center fails because this address is not reachable?

Yeah, that lines up with what I’ve seen before.

Basically the CLI attempt to confirm the cluster is functional at the end, but it can’t connect to the new server anymore because its certificate as changed to the cluster one.

But that’s happening after everything else succeeded so the cluster is still perfectly fine.

We don’t include any name or IP addresses in our certificates, we perform exact certificate matching instead and ignore all fields. You’re getting that kind of weird error when they’re not a perfect match, which here would likely be because the server you joined is now responding with the cluster-wide certificate.

I think, in my last setup, the error is not due to the certificates but because the client tries to connect to server1 using its cluster.https_address, which is unreachable from the client since the internal network is isolated.

It is not clear to me why Incus is trying to contact individual cluster members directly after the cluster is formed. Wouldn’t it make more sense to confirm that the cluster is functional using the cluster remote address?

I think it’s basically a race condition. Part of the cluster joining is sending a request and then waiting for an operation to complete. If the request makes enough progress before we get to attach to the operation, we get the connection error.

I’ve tried to reproduce it locally with some VMs and haven’t been able to, likely because network latency is low enough to hide the problem.

I repeated the steps a few times and the error is consistently triggered in my case, also when using VMs. Are you sure that the internal network was not reachable from the client? If it is reachable, there is no error.

I described the detailed steps of my setup here:

Also, even if it is a race condition, I do not think the client should ever send a request to remote hosts using cluster.https_address.