Incus upgrade didn't complete successfully and quorum recovery attempt failed in this scenario

Did an apt dist-upgrade on all Incus hosts for the recent OpenSSH vulnerability and near enough the same time but cluster is not accessible now. Fortunately instances seem to be running. How can I manipulate the database to get the cluster up and hosts5 & 6 to complete their upgrade successfully?

6 physical servers but #2 isn’t in the cluster yet, so missing from the Incus cluster numbering

From the apt log:

  • incus:amd64 (1:6.2-202406131848-ubuntu22.04, 1:6.2-202406290147-ubuntu22.04)

All using this repo:

  • https://pkgs.zabbly.com/incus/stable

Hosts failed to upgrade during the --configure stage due to timeout at 20%:

  • #5
  • #6

I followed this for host3 but it didn’t help so tried host4 which also didn’t help:

I did try restarting:

  • incus.service
  • incus.socket

and the apt process on host5 & 6 all at the same time in hope they would agree a new leader.

To summarise…

Upgraded hosts;

  • host1
  • host3
  • host4

Hosts failed to complete upgrade:

  • host5
  • host6

Hosts with only themselves in local raft_nodes:

  • host3
  • host4

Hosts hang with incus cluster list after service starts or fails:

  • host1
  • host5
  • host6

Hosts show themselves as database leader and online in this table from incus cluster list:

  • host3
  • host4
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| NAME  |              URL                  |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS  |                                    MESSAGE                                    |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| host1 | https://host1.somedomain.net:8443 |                 | x86_64       | default        |             | OFFLINE | No heartbeat for 3h58m31.746395297s (2024-07-02 04:45:37.050266109 +0000 UTC) |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| host3 | https://host3.somedomain.net:8443 |                 | x86_64       | default        |             | OFFLINE | No heartbeat for 4h0m42.353291518s (2024-07-02 04:43:26.443367025 +0000 UTC)  |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| host4 | https://host4.somedomain.net:8443 | database-leader | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
|       |                                   | database        |              |                |             |         |                                                                               |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| host5 | https://host5.somedomain.net:8443 |                 | x86_64       | default        |             | OFFLINE | No heartbeat for 3h57m23.767476486s (2024-07-02 04:46:45.029172297 +0000 UTC) |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| host6 | https://host6.somedomain.net:8443 |                 | x86_64       | default        |             | OFFLINE | No heartbeat for 3h57m23.760268303s (2024-07-02 04:46:45.036397443 +0000 UTC) |
+-------+-----------------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+

The following hosts have emptied the local raft_nodes table due to the quorum recovery attempt:

  • host3
  • host4

whereas all the remaining hosts show the same values:

+----+---------------------------+------+-------+
| id |               address     | role | name  |
+----+---------------------------+------+-------+
| 5  | host5.somedomain.net:8443 | 0    | host5 |
| 6  | host3.somedomain.net:8443 | 1    | host3 |
| 7  | host1.somedomain.net:8443 | 2    | host1 |
| 8  | host4.somedomain.net:8443 | 0    | host4 |
| 9  | host6.somedomain.net:8443 | 0    | host6 |
+----+---------------------------+------+-------+

/etc/hosts files are in use with these hosts in case DNS instances on them are down, so name lookup not an issue and they are static addresses so nothing changed there.

host1 incusd.log:

time="2024-07-02T09:08:26Z" level=warning msg="Dqlite: attempt 5: server host1.somedomain.net:8443: no known leader"
time="2024-07-02T09:08:26Z" level=warning msg="Dqlite: attempt 5: server host3.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host3.somedomain.net:8443\": dial tcp 192.168.123.9:8443: connect: connection refused"
time="2024-07-02T09:08:26Z" level=warning msg="Dqlite: attempt 5: server host4.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host4.somedomain.net:8443\": dial tcp 192.168.123.11:8443: connect: connection refused"
time="2024-07-02T09:08:26Z" level=warning msg="Dqlite: attempt 5: server host5.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host5.somedomain.net:8443\": dial tcp 192.168.123.13:8443: connect: connection refused"
time="2024-07-02T09:08:26Z" level=warning msg="Dqlite: attempt 5: server host6.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host6.somedomain.net:8443\": dial tcp 192.168.123.17:8443: connect: connection refused"

host3 incusd.log:

time="2024-07-02T09:11:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host1.somedomain.net:8443/internal/database\": dial tcp 192.168.123.5:8443: connect: connection refused" remote="host1.somedomain.net:8443"
time="2024-07-02T09:11:07Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host5.somedomain.net:8443/internal/database\": dial tcp 192.168.123.13:8443: connect: connection refused" remote="host5.somedomain.net:8443"
time="2024-07-02T09:11:10Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host5.somedomain.net:8443/internal/database\": dial tcp 192.168.123.13:8443: connect: connection refused" remote="host5.somedomain.net:8443"
time="2024-07-02T09:11:15Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host4.somedomain.net:8443/internal/database\": dial tcp 192.168.123.11:8443: connect: connection refused" remote="host4.somedomain.net:8443"
time="2024-07-02T09:11:15Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host6.somedomain.net:8443/internal/database\": dial tcp 192.168.123.17:8443: connect: connection refused" remote="host6.somedomain.net:8443"
time="2024-07-02T09:11:16Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host1.somedomain.net:8443/internal/database\": dial tcp 192.168.123.5:8443: connect: connection refused" remote="host1.somedomain.net:8443"
time="2024-07-02T09:11:20Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host4.somedomain.net:8443/internal/database\": dial tcp 192.168.123.11:8443: connect: connection refused" remote="host4.somedomain.net:8443"

host4 incusd.log:

time="2024-07-02T09:14:03Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host3.somedomain.net:8443/internal/database\": dial tcp 192.168.123.9:8443: connect: connection refused" remote="host3.somedomain.net:8443"
time="2024-07-02T09:14:06Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host1.somedomain.net:8443/internal/database\": dial tcp 192.168.123.5:8443: connect: connection refused" remote="host1.somedomain.net:8443"
time="2024-07-02T09:14:11Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host6.somedomain.net:8443/internal/database\": dial tcp 192.168.123.17:8443: connect: connection refused" remote="host6.somedomain.net:8443"
time="2024-07-02T09:14:12Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host5.somedomain.net:8443/internal/database\": dial tcp 192.168.123.13:8443: connect: connection refused" remote="host5.somedomain.net:8443"
time="2024-07-02T09:14:13Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://host3.somedomain.net:8443/internal/database\": dial tcp 192.168.123.9:8443: connect: connection refused" remote="host3.somedomain.net:8443"

host5 incusd.log:

time="2024-07-02T09:15:02Z" level=warning msg="Dqlite: attempt 7: server host1.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host1.somedomain.net:8443\": dial tcp 192.168.123.5:8443: connect: connection refused"
time="2024-07-02T09:15:02Z" level=warning msg="Dqlite: attempt 7: server host3.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host3.somedomain.net:8443\": dial tcp 192.168.123.9:8443: connect: connection refused"
time="2024-07-02T09:15:02Z" level=warning msg="Dqlite: attempt 7: server host4.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host4.somedomain.net:8443\": dial tcp 192.168.123.11:8443: connect: connection refused"
time="2024-07-02T09:15:02Z" level=warning msg="Dqlite: attempt 7: server host5.somedomain.net:8443: no known leader"
time="2024-07-02T09:15:02Z" level=warning msg="Dqlite: attempt 7: server host6.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host6.somedomain.net:8443\": dial tcp 192.168.123.17:8443: connect: connection refused"

host6 incusd.log:

time="2024-07-02T09:15:54Z" level=warning msg="Dqlite: attempt 4: server host1.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host1.somedomain.net:8443\": dial tcp 192.168.123.5:8443: connect: connection refused"
time="2024-07-02T09:15:54Z" level=warning msg="Dqlite: attempt 4: server host3.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host3.somedomain.net:8443\": dial tcp 192.168.123.9:8443: connect: connection refused"
time="2024-07-02T09:15:54Z" level=warning msg="Dqlite: attempt 4: server host4.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host4.somedomain.net:8443\": dial tcp 192.168.123.11:8443: connect: connection refused"
time="2024-07-02T09:15:54Z" level=warning msg="Dqlite: attempt 4: server host5.somedomain.net:8443: dial: Failed connecting to HTTP endpoint \"host5.somedomain.net:8443\": dial tcp 192.168.123.13:8443: connect: connection refused"
time="2024-07-02T09:15:54Z" level=warning msg="Dqlite: attempt 4: server host6.somedomain.net:8443: no known leader"

UPDATE:
I’ve tried repopulating the local raft_nodes table on host3 or 4 with what was present in other hosts local.db based on the output of incus admin sql local .dump and :

  • restarting incus.service & incus.socket on hosts1, 3, & 4
  • restarting the upgrades on host5 & 6

but this hasn’t helped.

Thanks

Not sure why this was in our /etc/hosts but it was the cause: