Last Snap Refresh has left my LXD cluster barely functioning again -"unix.socket: connect: connection refused|

It is trying , it seems to have updated to 4.8 but just cant start

What’s in that lxd.log file? maybe it’s failing to start somehow.

Looks like they have different versions of database

No other server is complaining about database

Is there a way to query each about what version of database they are using?

Okay, so we were looking at things the wrong way I think. Q2 is now on 4.8 and waiting for the other to confirm that they are too.

So it’s likely one of the other 3 that’s failed to update the DB with its new version.
Can you bounce all 3 others with systemctl reload snap.lxd.daemon?

You can also check lxd.log on all 4. You should see them all start properly as 4.8 and then hang for a bit in the DB stage as they send heartbeats around to confirm the versions are consistent, then things should unblock.

If not, then either one of them still isn’t on the right revision or there’s some kind of DB connectivity issue preventing one of them from updating its record.

catch 22, so what can I do

They say they are on 18402

Ok, can you show ps aux | grep lxd.*logfile and the content of /var/snap/lxd/common/lxd/logs/lxd.log from all 4 at this point?

mm

They are running

You’ve been showing me 4 systems but the log indicates 5 systems.

The 4 you have online now appear generally happy and talking to each other, one of them got elected leader but they’re all apparently waiting for a 5th server to come online with version 4.8.

84.17.40.59 looks like

That was a server that is no longer available
I forgot all about it, can we kill it.

Any Lxc function just hangs

Can you try lxd sql global "SELECT * FROM nodes;" on one of the machines?

lxd sql global “SELECT * FROM nodes;”
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch | failure_domain_id |
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+
| 1 | Q1 | | 84.:8443 | 39 | 215 | 2020-11-30T14:27:07.028999416-05:00 | 0 | 2 | |
| 2 | Q3 | | 848443 | 39 | 215 | 2020-11-30T14:27:07.0292753-05:00 | 0 | 2 | |
| 3 | Q2 | | 84:8443 | 39 | 215 | 2020-11-30T14:27:07.02952663-05:00 | 0 | 2 | |
| 4 | Q4 | | 84.443 | 39 | 215 | 2020-11-30T14:27:07.028601672-05:00 | 0 | 2 | |
| 5 | q5 | | 8:8443 | 39 | 212 | 2020-11-16T08:04:47.013113435-05:00 | 0 | 2 | |
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+

The user API will hang until the database schema is supported by all servers otherwise things could break in very unpredictable way, but the internal API is immune to that so you can use it for that kind of recovery.