After upgrading Incus from 6.0 to 6.3 while a member node was dead, the cluster become unable to start. It seems that the avaialble nodes are waiting for the dead node to become available. I see the repeated logs shown below.
time="2024-08-09T12:45:28+09:00" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://172.16.40.24:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="172.16.40.24:8443"
And also, there are logs like these:
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.17:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.17:8443\": dial tcp 172.16.40.17:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.18:8443: no known leader"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.19:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.19:8443\": dial tcp 172.16.40.19:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.20:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.20:8443\": dial tcp 172.16.40.20:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.21:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.21:8443\": dial tcp 172.16.40.21:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.22:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.22:8443\": dial tcp 172.16.40.22:8443: connect: connection refused"
time="2024-08-09T14:27:16+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.23:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.23:8443\": dial tcp 172.16.40.23:8443: connect: connection refused"
time="2024-08-09T14:27:20+09:00" level=warning msg="Dqlite: attempt 1: server 172.16.40.24:8443: dial: Failed connecting to HTTP endpoint \"172.16.40.24:8443\": dial tcp 172.16.40.24:8443: connect: no route to host"
The incusd service on the available nodes won’t start, so I cannot perform incus cluster remove --force NODE
. Does anyone know how to remove the node and get the cluster to start in this situation?