[Solved] Snap auto update lxd 3.20 to 3.21 hangs on waiting for other cluster nodes to upgrade their versions

aaa · February 21, 2020, 1:31am

The possible edge case here is that the host is a cluster of 1. ‘sudo lxd cluster list-database’ correctly returns:

±---------------+
| ADDRESS |
±---------------+
| 10.41.0.3:8443 |
±---------------+

snap changes returns:

ID Status Spawn Ready Summary
18 Doing yesterday at 08:42 UTC - Auto-refresh snaps “core”, “lxd”

sudo tail /var/snap/lxd/common/lxd/logs/lxd.log returns:

t=2020-02-20T22:08:40+0000 lvl=warn msg=“Dqlite server proxy Unix → TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:08:40+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”
t=2020-02-20T22:09:40+0000 lvl=info msg=“Initializing global database”
t=2020-02-20T22:09:40+0000 lvl=warn msg=“Dqlite client proxy TLS → Unix: read tcp 10.41.0.3:45902->10.41.0.3:8443: use of closed network connection”
t=2020-02-20T22:09:40+0000 lvl=warn msg=“Dqlite server proxy Unix → TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:09:41+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”
t=2020-02-20T22:10:41+0000 lvl=info msg=“Initializing global database”
t=2020-02-20T22:10:41+0000 lvl=warn msg=“Dqlite client proxy TLS → Unix: read tcp 10.41.0.3:45930->10.41.0.3:8443: use of closed network connection”
t=2020-02-20T22:10:41+0000 lvl=warn msg=“Dqlite server proxy Unix → TLS: read unix @->@00834: use of closed network connection”
t=2020-02-20T22:10:41+0000 lvl=info msg=“Wait for other cluster nodes to upgrade their versions”

which has been updating since the snap 3.20 to 3.21 update.

lxc list and lxc cluster list hang.

Attempts to revert finally worked and lxd is back on 3.20. snap changes returns:

ID Status Spawn Ready Summary
18 Undone 2 days ago, at 08:42 UTC yesterday at 23:28 UTC Auto-refresh snaps “core”, “lxd”
19 Done yesterday at 23:28 UTC yesterday at 23:31 UTC Auto-refresh snaps “core”, “lxd”
20 Done yesterday at 23:54 UTC yesterday at 23:54 UTC Refresh “lxd” snap from “3.20/stable” channel

lxc list and lxc cluster list now work.

stgraber · February 21, 2020, 1:50pm

@freeekanayaka ideas?

freeekanayaka · February 21, 2020, 2:05pm

What does:

lxd sql global "SELECT id, schema, api_extensions FROM nodes"

return?

aaa · February 21, 2020, 3:47pm

lxd sql global “SELECT id, schema, api_extensions FROM nodes” returns:

±—±-------±---------------+
| id | schema | api_extensions |
±—±-------±---------------+
| 1 | 24 | 165 |
| 4 | 24 | 165 |
±—±-------±---------------+

Note: This is after reverting back to 3.20 if that may matter.

id 4 is a left over entry after a previous node died(actually reprovisioned in MAAS). It was removed from the cluster with: lxc cluster remove --force . All was done in 3.20 before the snap update to 3.21.

freeekanayaka · February 21, 2020, 4:11pm

There are two nodes in that table, any idea why? Since you mentioned this being a single node cluster and lxc cluster list returning only one node.

aaa · February 21, 2020, 4:20pm

I think you may have replied while I was editing my last post. This is likely related to a previous issue I posted:
https://discuss.linuxcontainers.org/t/lxd-fails-to-connect-to-global-database/6792/2

To clear that situation, I needed to perform a recovery as described in the LXD docs.
https://linuxcontainers.org/lxd/docs/master/clustering#disaster-recovery

freeekanayaka · February 21, 2020, 4:22pm

Did you do this too:


In order to permanently delete the cluster members that you have lost, you can run the command:

lxc cluster remove <name> --force

?

If yes, there might be a bug in the recovery process.

aaa · February 21, 2020, 4:24pm

Yes. That exact command.

freeekanayaka · February 21, 2020, 4:26pm

Are you absolutely sure that lxc cluster list returns only one node? That doesn’t seem possible if that table has more than one entry.

freeekanayaka · February 21, 2020, 4:27pm

Note that lxc cluster list != lxc cluster list-database

aaa · February 21, 2020, 4:29pm

lxc cluster list returns:

±-----------±-----------------------±---------±-------±------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±-----------±-----------------------±---------±-------±------------------±-------------+
| lxd110maas | https://10.41.0.3:8443 | YES | ONLINE | fully operational | x86_64 |
±-----------±-----------------------±---------±-------±------------------±-------------+

Also: lxd sql local “SELECT * FROM raft_nodes” returns:

±—±---------------±-----+
| id | address | role |
±—±---------------±-----+
| 1 | 10.41.0.3:8443 | 0 |
±—±---------------±-----+

freeekanayaka · February 21, 2020, 4:32pm

What does

lxd sql global “SELECT * FROM nodes”

return?

I’m wondering if the extra node is pending.

aaa · February 21, 2020, 4:34pm

Yep. lxd sql global “SELECT * FROM nodes” returns:

±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+
| 1 | lxd110maas | | 10.41.0.3:8443 | 24 | 165 | 2020-02-21T16:25:24.260859041Z | 0 | 2 |
| 4 | lxd110h01 | | :8443 | 24 | 165 | 2020-02-12T18:01:07Z | 1 | 2 |
±—±-----------±------------±---------------±-------±---------------±-------------------------------±--------±-----+

freeekanayaka · February 21, 2020, 4:36pm

Okay, you have a pending node. Do you recall trying to adding a node to the cluster and failing it midway?

I think we don’t handle this situation quite well. You’ll probably need to run this command:

lxd sql global "DELETE FROM nodes WHERE id=4"

in order to get rid of that leftover node.

We’ll need a fix on our part to ignore pending nodes at startup.

aaa · February 21, 2020, 4:41pm

I’d have to say ‘yes’ with high probability. I’ve been playing with MAAS to provision nodes which will join a cluster. There have been ‘many’ trials where the provisioning has failed.

lxd sql global “DELETE FROM nodes WHERE id=4”
followed by:
sudo snap refresh --channel=stable lxd

successfully upgraded to 3.21.

Thanks for the help.