Incus cluster broken - needs help

tregubovav · November 28, 2024, 6:24pm

I have 7 node Incus cluster with ceph (microceph) storage running on top of RPI4B SCBs.
A month ago I implemented short-lived (Incus) cluster certificate rotation which worked well until 2 days ago. However the cluster certificate was not updated couple days ago and as result the Incus cluster lost connection between nodes. The error was: Failed to send heartbeat request: Put "https://192.168.82.6:8443/internal/database": tls: failed to verify certificate: x509: certificate has expired or is not yet valid.
So cluster API did not work partially and I was not able to install refreshed certificate using incus cluster update-certificate, so I have to upload refreshed cluster certificate to /var/lib/incus/cluster.* and restarted incus service on 3 nodes to check whether it helps. However, the cluster stopped communication after that. All nodes have the same errors: level=error msg="Failed connecting to global database" attempt=25 err="failed to create cowsql connection: no available cowsql leader server found"

Here is cluster config below:

> sudo incus admin cluster show
# Latest dqlite segment ID: 24588027

members:
- id: 1
  name: picl-01
  address: 192.168.82.1:8443
  role: voter
- id: 2
  name: picl-02
  address: 192.168.82.2:8443
  role: stand-by
- id: 3
  name: picl-03
  address: 192.168.82.3:8443
  role: voter
- id: 4
  name: picl-04
  address: 192.168.82.4:8443
  role: voter
- id: 5
  name: picl-05
  address: 192.168.82.5:8443
  role: stand-by
- id: 6
  name: picl-06
  address: 192.168.82.6:8443
  role: voter
- id: 7
  name: picl-07
  address: 192.168.82.7:8443
  role: voter

Interestly, the local db on all nodes does not show two nodes:

sudo incus admin cluster list-database
+-------------------+
|      ADDRESS      |
+-------------------+
| 192.168.82.1:8443 |
+-------------------+
| 192.168.82.3:8443 |
+-------------------+
| 192.168.82.4:8443 |
+-------------------+
| 192.168.82.6:8443 |
+-------------------+
| 192.168.82.7:8443 |
+-------------------+

I’m interested in whether I can restore cluster without shutting down running instances?
If it’s not possible, how to reinstall cluster and import instances from ceph and cephfs storages?

stgraber · November 28, 2024, 7:28pm

You should be able to manually go and replace cluster.crt and cluster.key on all your servers, then run systemctl restart incus to restart Incus and have it pick up the new cert.

This won’t impact any running instance.

tregubovav · November 28, 2024, 8:13pm

This was a first step I did. Cluster certificate and key were updated on all nodes. However, after restarting incus service on them cluster nodes lost connection to each other.
Now all local or remote command line calls end by timeout. For example:

Call from remote client:

incus listi -v --debug
DEBUG  [2024-11-28T11:33:22-08:00] Connecting to a remote Incus over HTTPS       url="https://picl:8443"
DEBUG  [2024-11-28T11:33:22-08:00] Sending request to Incus                      etag= method=GET url="https://picl:8443/1.0"
Error: Get "https://picl:8443/1.0": EOF

Call from the node’s console:

incus cluster list
Error: Get "http://unix.socket/1.0": EOF

All nodes has logs the same error:
level=error msg="Failed connecting to global database" attempt=6 err="failed to create cowsql connection: no available cowsql leader server found"

P.S.
I was upgraded the incus from 6.5 to 6.7 a day before the issue happened. I may expect that the issue may related to changes between 6.5 and 6.7. Did any changes in cluster certificate validation made between 6.5 and 6.7? I use an DNS alias for the incus cluster and cluster certificate includes SANs for this alias (DNS short and FQD names), but not an individual host names.

tregubovav · November 28, 2024, 10:44pm

Finally I recovered the cluster.

As the all members lost the quorum I had to execute Recover from quorum loss procedure.
I thought about this from beginning the investigation, but this procedure warning forced me to postpone executing it until I could fin any other methods.

You should run this command only if you are *absolutely* certain that this is
the only database node left in your cluster AND that other database nodes will
never come back (i.e. their daemon won't ever be started again).

This will make this server the only member of the cluster, and it won't
be possible to perform operations on former cluster members anymore.

However, everything went well and all nodes became online after I executed this procedure on one node.

Now cluster looks healthy and all nodes are online.

+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
|  NAME   |            URL            |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS |      MESSAGE      |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-01 | https://192.168.82.1:8443 | database-leader  | aarch64      | default        |             | ONLINE | Fully operational |
|         |                           | database         |              |                |             |        |                   |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-02 | https://192.168.82.2:8443 | database         | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-03 | https://192.168.82.3:8443 | database         | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-04 | https://192.168.82.4:8443 | database-standby | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-05 | https://192.168.82.5:8443 | database         | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-06 | https://192.168.82.6:8443 | database         | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| picl-07 | https://192.168.82.7:8443 | database-standby | aarch64      | default        |             | ONLINE | Fully operational |
+---------+---------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

The only one thing makes me worried a bit: ‘incus admin cluster list-database’ displays that 2 nodes are missed (picl-04 and `picl-07’):

+-------------------+
|      ADDRESS      |
+-------------------+
| 192.168.82.1:8443 |
+-------------------+
| 192.168.82.2:8443 |
+-------------------+
| 192.168.82.3:8443 |
+-------------------+
| 192.168.82.5:8443 |
+-------------------+
| 192.168.82.6:8443 |
+-------------------+

Upd:
picl-04 and picl-07 are sand-by nodes. this why they are not listed by the command incus admin cluster list-database.
Issue looks resolved now.

tregubovav · November 29, 2024, 12:47am

Happy Thanksgiving!