Incus Cluster non-responsive

I have a three node cluster. I upgraded to the lastest 6.2 incus components. All incus commands now hang. I tried “incus list” and “incus cluster show”. All incus commands hang and never return. My incus cluster serves are all running on Ubuntu 24.04. I have taken all the updates and that’s what seems to have caused the issue. I shutdown all three nodes and tried rebooting one at a time. Everything comes up fine. Incus is started and incus daemon is running without error. No containers start and all incus commands hang. Here’s all I can find:

sudo systemctl status incus*
Warning: The unit file, source configuration file or drop-ins of incus-startup.service changed on disk. Run 'systemctl >
Warning: The unit file, source configuration file or drop-ins of incus-lxcfs.service changed on disk. Run 'systemctl da>
Warning: The unit file, source configuration file or drop-ins of incus.socket changed on disk. Run 'systemctl daemon-re>
Warning: The unit file, source configuration file or drop-ins of incus-user.socket changed on disk. Run 'systemctl daem>
× incus-startup.service - Incus - Startup check
     Loaded: loaded (/usr/lib/systemd/system/incus-startup.service; enabled; preset: enabled)
     Active: failed (Result: timeout) since Mon 2024-07-08 05:25:11 UTC; 23min ago
   Main PID: 1139 (code=killed, signal=TERM)
        CPU: 200ms

Jul 08 05:15:11 vmscloud-incus systemd[1]: Starting incus-startup.service - Incus - Startup check...
Jul 08 05:25:11 vmscloud-incus systemd[1]: incus-startup.service: start operation timed out. Terminating.
Jul 08 05:25:11 vmscloud-incus systemd[1]: incus-startup.service: Main process exited, code=killed, status=15/TERM
Jul 08 05:25:11 vmscloud-incus systemd[1]: incus-startup.service: Failed with result 'timeout'.
Jul 08 05:25:11 vmscloud-incus systemd[1]: Failed to start incus-startup.service - Incus - Startup check.

● incus-lxcfs.service - Incus - LXCFS daemon
     Loaded: loaded (/usr/lib/systemd/system/incus-lxcfs.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-07-08 05:15:11 UTC; 33min ago
   Main PID: 1170 (lxcfs)
      Tasks: 3 (limit: 19044)
     Memory: 1.1M (peak: 1.5M)
        CPU: 27ms
     CGroup: /system.slice/incus-lxcfs.service
             └─1170 /opt/incus/bin/lxcfs /var/lib/incus-lxcfs

Jul 08 05:15:11 vmscloud-incus lxcfs[1170]: - proc_loadavg
Jul 08 05:15:11 vmscloud-incus lxcfs[1170]: - proc_meminfo
Jul 08 05:15:11 vmscloud-incus lxcfs[1170]: - proc_stat
lines 2-29


What’s in /var/log/incus/incusd.log and can you also show ps fauxww | grep incusd for good measure?

Sorry, lost power for 2 days with the hurricane.

Here’s /var/log/incus/incusd.log

TTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:37:58Z" level=warning msg="Dqlite: attempt 2: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:01Z" level=warning msg="Dqlite: attempt 2: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:01Z" level=warning msg="Dqlite: attempt 2: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:01Z" level=warning msg="Dqlite: attempt 3: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:04Z" level=warning msg="Dqlite: attempt 3: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:04Z" level=warning msg="Dqlite: attempt 3: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:05Z" level=warning msg="Dqlite: attempt 4: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:06Z" level=warning msg="Dqlite: attempt 4: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: i/o timeout"
time="2024-07-09T20:38:06Z" level=warning msg="Dqlite: attempt 4: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp: lookup vmsfog-incus: i/o timeout"
time="2024-07-09T20:38:07Z" level=error msg="Failed connecting to global database" attempt=8 err="failed to create cowsql connection: no available cowsql leader server found"
time="2024-07-09T20:38:09Z" level=warning msg="Dqlite: attempt 1: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:12Z" level=warning msg="Dqlite: attempt 1: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:12Z" level=warning msg="Dqlite: attempt 1: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:13Z" level=warning msg="Dqlite: attempt 2: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:16Z" level=warning msg="Dqlite: attempt 2: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:16Z" level=warning msg="Dqlite: attempt 2: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:16Z" level=warning msg="Dqlite: attempt 3: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:19Z" level=warning msg="Dqlite: attempt 3: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:19Z" level=warning msg="Dqlite: attempt 3: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:20Z" level=error msg="Failed connecting to global database" attempt=9 err="failed to create cowsql connection: no available cowsql leader server found"
time="2024-07-09T20:38:22Z" level=warning msg="Dqlite: attempt 1: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:23Z" level=warning msg="Dqlite: attempt 1: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:23Z" level=warning msg="Dqlite: attempt 1: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:23Z" level=warning msg="Dqlite: attempt 2: server 172.16.1.219:8443: no known leader"
time="2024-07-09T20:38:26Z" level=warning msg="Dqlite: attempt 2: server incus-member:8443: dial: Failed connecting to HTTP endpoint \"incus-member:8443\": dial tcp 172.16.1.74:8443: connect: no route to host"
time="2024-07-09T20:38:26Z" level=warning msg="Dqlite: attempt 2: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-09T20:38:27Z" level=warning msg="Dqlite: attempt 3: server 172.16.1.219:8443: no known leader"
root@vmscloud-incus:/home/scott# 

Also here’s the ps…

root@vmscloud-incus:/home/scott# ps fauxww | grep incusd
root        5569  0.4  0.6 6642776 112380 ?      Ssl  20:36   0:01 incusd --group incus-admin --logfile /var/log/incus/incusd.log
root        5570  0.0  0.2 6086696 42880 ?       Ssl  20:36   0:00 incusd waitready --timeout=600
root        5791  0.0  0.0   6544  2304 pts/1    S+   20:42   0:00                              \_ grep --color=auto incusd

Okay, so it’s trying to connect to both 172.16.1.74 and 172.16.1.65 as the other two systems.
The first isn’t responding at all while the second appears to either not have incus running or not have it be reachable over the network.

I’d probably focus on getting 172.16.1.74 back online first if possible.
On 172.16.1.65, it’d be useful to look if the incus process is running at all and if it is, look at its log.

Over on 172.16.1.74 here’s the log:

TTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server incus-member:8443: no known leader"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server incus-member:8443: no known leader"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server incus-member:8443: no known leader"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server incus-member:8443: no known leader"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:27Z" level=error msg="Failed connecting to global database" attempt=25 err="failed to create cowsql connection: no available cowsql leader server found"

Also on 172.16.1.74:

scott@incus-member:~$ sudo systemctl status incus*
[sudo] password for scott: 
● incus.service - Incus - Daemon
     Loaded: loaded (/usr/lib/systemd/system/incus.service; indirect; preset: enabled)
     Active: activating (start-post) since Mon 2024-07-08 12:33:05 UTC; 3min 56s ago
TriggeredBy: ● incus.socket
   Main PID: 3665 (incusd); Control PID: 3666 (incusd)
      Tasks: 29
     Memory: 51.9M (peak: 53.2M)
        CPU: 3.062s
     CGroup: /system.slice/incus.service
             ├─3665 incusd --group incus-admin --logfile /var/log/incus/incusd.log
             └─.control
               └─3666 incusd waitready --timeout=600

Jul 08 12:36:58 incus-member incusd[3665]: time="2024-07-08T12:36:58Z" level=warning msg="Dqlite: attempt 9: server vms>
Jul 08 12:36:59 incus-member incusd[3665]: time="2024-07-08T12:36:59Z" level=warning msg="Dqlite: attempt 10: server 17>
Jul 08 12:36:59 incus-member incusd[3665]: time="2024-07-08T12:36:59Z" level=warning msg="Dqlite: attempt 10: server in>
Jul 08 12:36:59 incus-member incusd[3665]: time="2024-07-08T12:36:59Z" level=warning msg="Dqlite: attempt 10: server vm>
Jul 08 12:37:00 incus-member incusd[3665]: time="2024-07-08T12:37:00Z" level=warning msg="Dqlite: attempt 11: server 17>
Jul 08 12:37:00 incus-member incusd[3665]: time="2024-07-08T12:37:00Z" level=warning msg="Dqlite: attempt 11: server in>
Jul 08 12:37:00 incus-member incusd[3665]: time="2024-07-08T12:37:00Z" level=warning msg="Dqlite: attempt 11: server vm>
Jul 08 12:37:01 incus-member incusd[3665]: time="2024-07-08T12:37:01Z" level=warning msg="Dqlite: attempt 12: server 17>
Jul 08 12:37:01 incus-member incusd[3665]: time="2024-07-08T12:37:01Z" level=warning msg="Dqlite: attempt 12: server in>
Jul 08 12:37:01 incus-member incusd[3665]: time="2024-07-08T12:37:01Z" level=warning msg="Dqlite: attempt 12: server vm>

● incus.socket - Incus - Daemon (unix socket)
     Loaded: loaded (/usr/lib/systemd/system/incus.socket; enabled; preset: enabled)
     Active: active (running) since Mon 2024-07-08 06:18:16 UTC; 6h ago
   Triggers: ● incus.service

On 172.16.1.65 here’s the log:

TTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server incus-member:8443: no known leader"
time="2024-07-08T11:14:23Z" level=warning msg="Dqlite: attempt 9: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server incus-member:8443: no known leader"
time="2024-07-08T11:14:24Z" level=warning msg="Dqlite: attempt 10: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server incus-member:8443: no known leader"
time="2024-07-08T11:14:25Z" level=warning msg="Dqlite: attempt 11: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server 172.16.1.219:8443: no known leader"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server incus-member:8443: no known leader"
time="2024-07-08T11:14:26Z" level=warning msg="Dqlite: attempt 12: server vmsfog-incus:8443: dial: Failed connecting to HTTP endpoint \"vmsfog-incus:8443\": dial tcp 172.16.1.65:8443: connect: connection refused"
time="2024-07-08T11:14:27Z" level=error msg="Failed connecting to global database" attempt=25 err="failed to create cowsql connection: no available cowsql leader server found"

and also on 172.16.1.65:

scott@vmsfog-incus:~$ sudo systemctl status incus*
[sudo] password for scott: 
● incus-lxcfs.service - Incus - LXCFS daemon
     Loaded: loaded (/usr/lib/systemd/system/incus-lxcfs.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-07-08 06:20:58 UTC; 1 day 19h ago
   Main PID: 1161 (lxcfs)
      Tasks: 3 (limit: 19044)
     Memory: 1.1M (peak: 1.5M)
        CPU: 21ms
     CGroup: /system.slice/incus-lxcfs.service
             └─1161 /opt/incus/bin/lxcfs /var/lib/incus-lxcfs

Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_loadavg
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_meminfo
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_stat
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_swaps
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_uptime
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - proc_slabinfo
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - shared_pidns
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - cpuview_daemon
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - loadavg_daemon
Jul 08 06:20:58 vmsfog-incus lxcfs[1161]: - pidfds

● incus-user.socket - Incus - Daemon (user unix socket)
     Loaded: loaded (/usr/lib/systemd/system/incus-user.socket; enabled; preset: enabled)
     Active: active (listening) since Mon 2024-07-08 06:20:57 UTC; 1 day 19h ago
   Triggers: ● incus-user.service
     Listen: /var/lib/incus/unix.socket.user (Stream)
      Tasks: 0 (limit: 19044)
     Memory: 0B (peak: 256.0K)

I’m pretty confused as to why 172.16.1.65 is trying to connect to itself here.

Can you show netstat -lnp | grep 8443 on 172.16.1.65?

The kicker is that it just broke one day without a reboot or anything I could identify.

root@vmsfog-incus:/home/scott# netstat -lnp | grep 8443
tcp        0      0 127.0.1.1:8443          0.0.0.0:*               LISTEN      8525/incusd       

Yeah, so that’s your problem right there.

I suspect you’ve configured your cluster to use DNS names for the various machines which isn’t really a problem in itself. The problem is that on this particular machine when you do getent hosts vmsfog-incus it will most likely return 127.0.1.1.

Now what’s happening is that when Incus starts up, it looks at its config and sees that it’s supposed to listen on vmsfog-incus:8443. It then resolves vmsfog-incus which gives it 127.0.1.1 and so it listens on that address. As that’s a loopback address, nothing else in your cluster can connect to it, causing your current mess.

I had to solve a similar issue for a customer this morning and basically the easiest fix should be:

echo "UPDATE config SET value=':8443' WHERE key='core.https_address'" | sqlite3 /var/lib/incus/database/local.db

This assumes that you had a local core.https_address key set. It will then set it to :8443 instead of vmsfog-incus:8443 which will then cause incus to actually listen on non-loopback devices.

You should do the above and then kill -9 the incusd process to force it to restart and then check with netstat -lnp that it’s indeed listening on more than just 127.0.1.1.

Correct. I did configure my cluster with DNS names and I even put entries in the host tables of all cluster members as a backup in case DNS failed. I did what you said and I was able to issue incus commands again.

I had to perform the same operation on the other node also to get past the following. The obvious question is how did this occur on an operational cluster and also is using DNS names ill-advised?

scott@vmsfog-incus:~$ incus cluster list
+----------------+---------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
|      NAME      |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATUS  |                                    MESSAGE                                    |
+----------------+---------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| incus-member   | https://incus-member:8443 | database        | x86_64       | default        |             | OFFLINE | No heartbeat for 47h4m46.693248198s (2024-07-08 04:19:56.497102283 +0000 UTC) |
+----------------+---------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| vmscloud-incus | https://172.16.1.219:8443 | database-leader | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
|                |                           | database        |              |                |             |         |                                                                               |
+----------------+---------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+
| vmsfog-incus   | https://vmsfog-incus:8443 | database        | x86_64       | default        |             | ONLINE  | Fully operational                                                             |
+----------------+---------------------------+-----------------+--------------+----------------+-------------+---------+-------------------------------------------------------------------------------+

I personally tend to stay away from having critical services like Incus, OVN or Ceph depend on DNS being functional, but that’s also because my DNS infrastructure usually runs on top of Incus and I need to avoid those kind of circular dependencies.

Having /etc/hosts entries is fine and does avoid that issue, but my recommendation there would be to have separate entries for the hostname and for the FQDN and use the latter for the cluster.

So say you have a server with hostname foo and FQDN foo.example.net, then you’d have 127.0.1.1 foo and 192.0.2.10 foo.example.net.

That way you still have the normal behavior of having the hostname resolve to 127.0.1.1 while being able to use the FQDN in situations where you do need the externally reachable address.

Those are helpful considerations that I had not considered.