Last Snap Refresh has left my LXD cluster barely functioning again -"unix.socket: connect: connection refused|

Tony_Anytime · November 30, 2020, 6:47pm

It is trying , it seems to have updated to 4.8 but just cant start

stgraber · November 30, 2020, 6:47pm

What’s in that lxd.log file? maybe it’s failing to start somehow.

Tony_Anytime · November 30, 2020, 6:50pm

Looks like they have different versions of database

Tony_Anytime · November 30, 2020, 6:53pm

No other server is complaining about database

Tony_Anytime · November 30, 2020, 7:00pm

Is there a way to query each about what version of database they are using?

stgraber · November 30, 2020, 7:01pm

Okay, so we were looking at things the wrong way I think. Q2 is now on 4.8 and waiting for the other to confirm that they are too.

So it’s likely one of the other 3 that’s failed to update the DB with its new version.
Can you bounce all 3 others with systemctl reload snap.lxd.daemon?

stgraber · November 30, 2020, 7:02pm

You can also check lxd.log on all 4. You should see them all start properly as 4.8 and then hang for a bit in the DB stage as they send heartbeats around to confirm the versions are consistent, then things should unblock.

If not, then either one of them still isn’t on the right revision or there’s some kind of DB connectivity issue preventing one of them from updating its record.

Tony_Anytime · November 30, 2020, 7:06pm

catch 22, so what can I do

Tony_Anytime · November 30, 2020, 7:08pm

They say they are on 18402

stgraber · November 30, 2020, 7:09pm

Ok, can you show ps aux | grep lxd.*logfile and the content of /var/snap/lxd/common/lxd/logs/lxd.log from all 4 at this point?

Tony_Anytime · November 30, 2020, 7:12pm

mm

Tony_Anytime · November 30, 2020, 7:13pm

They are running

stgraber · November 30, 2020, 7:20pm

You’ve been showing me 4 systems but the log indicates 5 systems.

stgraber · November 30, 2020, 7:21pm

The 4 you have online now appear generally happy and talking to each other, one of them got elected leader but they’re all apparently waiting for a 5th server to come online with version 4.8.

stgraber · November 30, 2020, 7:23pm

84.17.40.59 looks like

Tony_Anytime · November 30, 2020, 7:25pm

That was a server that is no longer available
I forgot all about it, can we kill it.

Tony_Anytime · November 30, 2020, 7:26pm

Any Lxc function just hangs

stgraber · November 30, 2020, 7:26pm

Can you try lxd sql global "SELECT * FROM nodes;" on one of the machines?

Tony_Anytime · November 30, 2020, 7:27pm

lxd sql global “SELECT * FROM nodes;”
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch | failure_domain_id |
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+
| 1 | Q1 | | 84.:8443 | 39 | 215 | 2020-11-30T14:27:07.028999416-05:00 | 0 | 2 | |
| 2 | Q3 | | 848443 | 39 | 215 | 2020-11-30T14:27:07.0292753-05:00 | 0 | 2 | |
| 3 | Q2 | | 84:8443 | 39 | 215 | 2020-11-30T14:27:07.02952663-05:00 | 0 | 2 | |
| 4 | Q4 | | 84.443 | 39 | 215 | 2020-11-30T14:27:07.028601672-05:00 | 0 | 2 | |
| 5 | q5 | | 8:8443 | 39 | 212 | 2020-11-16T08:04:47.013113435-05:00 | 0 | 2 | |
±—±-----±------------±-----------------±-------±---------------±------------------------------------±--------±-----±------------------+

stgraber · November 30, 2020, 7:27pm

The user API will hang until the database schema is supported by all servers otherwise things could break in very unpredictable way, but the internal API is immune to that so you can use it for that kind of recovery.