Upgrading an Incus Cluster While a Member Node Is Down Causes Hanging of Any Incus Command

kojiwell · August 9, 2024, 5:16am

After upgrading Incus from 6.0 to 6.3 while a member node was dead, the cluster become unable to start. It seems that the avaialble nodes are waiting for the dead node to become available. I see the repeated logs shown below.

time="2024-08-09T12:45:28+09:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://172.16.40.24:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="172.16.40.24:8443"

And also, there are logs like these:

time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.17:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.17:8443\": dial tcp 172.16.40.17:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.18:8443: no known leader"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.19:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.19:8443\": dial tcp 172.16.40.19:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.20:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.20:8443\": dial tcp 172.16.40.20:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.21:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.21:8443\": dial tcp 172.16.40.21:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.22:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.22:8443\": dial tcp 172.16.40.22:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.23:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.23:8443\": dial tcp 172.16.40.23:8443: connect: connection refused"
time="2024-08-09T14:27:20+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.24:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.24:8443\": dial tcp 172.16.40.24:8443: connect: no route to host"

The incusd service on the available nodes won’t start, so I cannot perform incus cluster remove --force NODE. Does anyone know how to remove the node and get the cluster to start in this situation?

stgraber · August 9, 2024, 6:07am

incus admin sql global "UPDATE nodes SET schema=73, api_extensions=406"

That will fake the fact that all servers are on 6.3.

kojiwell · August 9, 2024, 6:18am

Thank you so much! That command solved the problem