I have 7 node Incus cluster with ceph (microceph) storage running on top of RPI4B SCBs.
A month ago I implemented short-lived (Incus) cluster certificate rotation which worked well until 2 days ago. However the cluster certificate was not updated couple days ago and as result the Incus cluster lost connection between nodes. The error was: Failed to send heartbeat request: Put "https://192.168.82.6:8443/internal/database": tls: failed to verify certificate: x509: certificate has expired or is not yet valid.
So cluster API did not work partially and I was not able to install refreshed certificate using incus cluster update-certificate, so I have to upload refreshed cluster certificate to /var/lib/incus/cluster.* and restarted incus service on 3 nodes to check whether it helps. However, the cluster stopped communication after that. All nodes have the same errors: level=error msg="Failed connecting to global database" attempt=25 err="failed to create cowsql connection: no available cowsql leader server found"
I’m interested in whether I can restore cluster without shutting down running instances?
If it’s not possible, how to reinstall cluster and import instances from ceph and cephfs storages?
You should be able to manually go and replace cluster.crt and cluster.key on all your servers, then run systemctl restart incus to restart Incus and have it pick up the new cert.
This was a first step I did. Cluster certificate and key were updated on all nodes. However, after restarting incus service on them cluster nodes lost connection to each other.
Now all local or remote command line calls end by timeout. For example:
Call from remote client:
incus listi -v --debug
DEBUG [2024-11-28T11:33:22-08:00] Connecting to a remote Incus over HTTPS url="https://picl:8443"
DEBUG [2024-11-28T11:33:22-08:00] Sending request to Incus etag= method=GET url="https://picl:8443/1.0"
Error: Get "https://picl:8443/1.0": EOF
Call from the node’s console:
incus cluster list
Error: Get "http://unix.socket/1.0": EOF
All nodes has logs the same error: level=error msg="Failed connecting to global database" attempt=6 err="failed to create cowsql connection: no available cowsql leader server found"
P.S.
I was upgraded the incus from 6.5 to 6.7 a day before the issue happened. I may expect that the issue may related to changes between 6.5 and 6.7. Did any changes in cluster certificate validation made between 6.5 and 6.7? I use an DNS alias for the incus cluster and cluster certificate includes SANs for this alias (DNS short and FQD names), but not an individual host names.
As the all members lost the quorum I had to execute Recover from quorum loss procedure.
I thought about this from beginning the investigation, but this procedure warning forced me to postpone executing it until I could fin any other methods.
You should run this command only if you are *absolutely* certain that this is
the only database node left in your cluster AND that other database nodes will
never come back (i.e. their daemon won't ever be started again).
This will make this server the only member of the cluster, and it won't
be possible to perform operations on former cluster members anymore.
However, everything went well and all nodes became online after I executed this procedure on one node.
Now cluster looks healthy and all nodes are online.