[Solved] Snap auto update lxd 3.20 to 3.21 hangs on waiting for other cluster nodes to upgrade their versions

The possible edge case here is that the host is a cluster of 1. ‘sudo lxd cluster list-database’ correctly returns:

±---------------+
| ADDRESS |
±---------------+
| 10.41.0.3:8443 |
±---------------+

snap changes returns:

ID Status Spawn Ready Summary
18 Doing yesterday at 08:42 UTC - Auto-refresh snaps “core”, “lxd”

sudo tail /var/snap/lxd/common/lxd/logs/lxd.log returns:

t=2020-02-20T22:08:40+0000 lvl=warn msg=“Dqlite server proxy Unix -> TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:08:40+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”
t=2020-02-20T22:09:40+0000 lvl=info msg=“Initializing global database”
t=2020-02-20T22:09:40+0000 lvl=warn msg=“Dqlite client proxy TLS -> Unix: read tcp 10.41.0.3:45902->10.41.0.3:8443: use of closed network connection”
t=2020-02-20T22:09:40+0000 lvl=warn msg=“Dqlite server proxy Unix -> TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:09:41+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”
t=2020-02-20T22:10:41+0000 lvl=info msg=“Initializing global database”
t=2020-02-20T22:10:41+0000 lvl=warn msg=“Dqlite client proxy TLS -> Unix: read tcp 10.41.0.3:45930->10.41.0.3:8443: use of closed network connection”
t=2020-02-20T22:10:41+0000 lvl=warn msg=“Dqlite server proxy Unix -> TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:10:41+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”

which has been updating since the snap 3.20 to 3.21 update.

lxc list and lxc cluster list hang.

Attempts to revert finally worked and lxd is back on 3.20. snap changes returns:

ID Status Spawn Ready Summary
18 Undone 2 days ago, at 08:42 UTC yesterday at 23:28 UTC Auto-refresh snaps “core”, “lxd”
19 Done yesterday at 23:28 UTC yesterday at 23:31 UTC Auto-refresh snaps “core”, “lxd”
20 Done yesterday at 23:54 UTC yesterday at 23:54 UTC Refresh “lxd” snap from “3.20/stable” channel

lxc list and lxc cluster list now work.

@freeekanayaka ideas?

What does:

lxd sql global "SELECT id, schema, api_extensions FROM nodes"

return?

lxd sql global “SELECT id, schema, api_extensions FROM nodes” returns:

±—±-------±---------------+
| id | schema | api_extensions |
±—±-------±---------------+
| 1 | 24 | 165 |
| 4 | 24 | 165 |
±—±-------±---------------+

Note: This is after reverting back to 3.20 if that may matter.

id 4 is a left over entry after a previous node died(actually reprovisioned in MAAS). It was removed from the cluster with: lxc cluster remove --force . All was done in 3.20 before the snap update to 3.21.

There are two nodes in that table, any idea why? Since you mentioned this being a single node cluster and lxc cluster list returning only one node.

I think you may have replied while I was editing my last post. This is likely related to a previous issue I posted:
https://discuss.linuxcontainers.org/t/lxd-fails-to-connect-to-global-database/6792/2

To clear that situation, I needed to perform a recovery as described in the LXD docs.
https://linuxcontainers.org/lxd/docs/master/clustering#disaster-recovery

Did you do this too:


In order to permanently delete the cluster members that you have lost, you can run the command:

lxc cluster remove <name> --force

?

If yes, there might be a bug in the recovery process.

Yes. That exact command.

Are you absolutely sure that lxc cluster list returns only one node? That doesn’t seem possible if that table has more than one entry.

Note that lxc cluster list != lxc cluster list-database

lxc cluster list returns:

±-----------±-----------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±-----------±-----------------------±---------±-------±------------------±-------------+
| lxd110maas | https://10.41.0.3:8443 | YES | ONLINE | fully operational | x86_64 |
±-----------±-----------------------±---------±-------±------------------±-------------+

Also: lxd sql local “SELECT * FROM raft_nodes” returns:

±—±---------------±-----+
| id | address | role |
±—±---------------±-----+
| 1 | 10.41.0.3:8443 | 0 |
±—±---------------±-----+

What does

lxd sql global “SELECT * FROM nodes”

return?

I’m wondering if the extra node is pending.

Yep. lxd sql global “SELECT * FROM nodes” returns:

±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+
| 1 | lxd110maas | | 10.41.0.3:8443 | 24 | 165 | 2020-02-21T16:25:24.260859041Z | 0 | 2 |
| 4 | lxd110h01 | | :8443 | 24 | 165 | 2020-02-12T18:01:07Z | 1 | 2 |
±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+

Okay, you have a pending node. Do you recall trying to adding a node to the cluster and failing it midway?

I think we don’t handle this situation quite well. You’ll probably need to run this command:

lxd sql global "DELETE FROM nodes WHERE id=4"

in order to get rid of that leftover node.

We’ll need a fix on our part to ignore pending nodes at startup.

I’d have to say ‘yes’ with high probability. I’ve been playing with MAAS to provision nodes which will join a cluster. There have been ‘many’ trials where the provisioning has failed.

lxd sql global “DELETE FROM nodes WHERE id=4”
followed by:
sudo snap refresh --channel=stable lxd

successfully upgraded to 3.21.

Thanks for the help.