Cluster node appears offline after upgrade to 3.15

TomvB · July 18, 2019, 6:57am

It seems to work again! Thanks

It is still strange. I haven’t touched the cluster for a long time.

TomvB · July 18, 2019, 6:58am

Wait.
https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m20.939295585s |

It’s automatically changing it?

stgraber · July 18, 2019, 7:03am

Can you do the usual round of:

lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

I’d like to see if it reverted to the same value as earlier.

TomvB · July 18, 2019, 7:07am

Yes, same as before.
https://discuss.linuxcontainers.org/t/lxd-3-15-has-been-released/5218/26?u=tomvb

stgraber · July 18, 2019, 7:10am

Ah, right, I see what’s going on, we can’t live update that table because the leader effectively ignores it and relies on its live version of the raft state instead.

So we’ll need to use startup time DB patches instead to get this sorted once and for all.

On all 3 nodes, create a file at /var/snap/lxd/common/lxd/database/patch.local.sql containing:

UPDATE raft_nodes SET id=4 WHERE id=5;
UPDATE raft_nodes SET id=5 WHERE id=6;

Once the file is ready on all 3 nodes, run systemctl reload snap.lxd.daemon on all 3 nodes, in very quick succession (so one that hasn’t restarted yet won’t have the time to send bad data to the ones that have been updated).

They should then all come back up online with a sane raft_nodes table, go through leader election and then use the proper IDs for nodes moving forward.

stgraber · July 18, 2019, 7:23am

@TomvB how did that go?

stgraber · July 18, 2019, 7:36am

I’ve moved this thread from LXD 3.15 has been released into its own topic to keep things easier to search.

TomvB · July 18, 2019, 7:36am

Nope. I tried it 3 times. It’s reverting the changes.

stgraber · July 18, 2019, 7:41am

Can you confirm that the patch.local.sql file disappeared every time?

Anyway, lets try to avoid that race entirely:

Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes
Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile
Write the patch file again
Start LXD back up by running: lxc info

Note that you’ll need to run lxc info on at least two nodes before it will respond as it needs to get quorum for the database.

stgraber · July 18, 2019, 7:42am

As all LXD daemons will be offline this time around, this should avoid the in-memory list of nodes messing things up for us.

stgraber · July 18, 2019, 7:43am

(I really wonder how the two tables got out of sync in the first place, if we see others hitting this issue, we’ll need to find some automatic way of recovering from this)

TomvB · July 18, 2019, 7:50am

Can you confirm that the patch.local.sql file disappeared every time?
Yes

Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes

Done

Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile

Done

Write the patch file again

Done

Start LXD back up by running: lxc info

Hangs, did it with systemctl start snap.lxd.daemon (executed simultaneously on 3 nodes)

CyrusTheVirusG · July 18, 2019, 8:04am

Same issue.

stgraber · July 18, 2019, 8:08am

Ok, it’s 4am here and my brain is fried so I don’t think I’ll manage to sort this out now.
As far as I can tell, the only negative impact is on the heartbeats.

If you know all your nodes are online (as is the case here so far), try running:

lxc config set cluster.offline_threshold 259200

This will bump the offline threshold from the default of 20s to a rather long 3 days.

Everything should behave just fine with that, so long as no node goes offline, if one does, then it will not be detected.

This should get you out of the immediate issue until we sort this out.

TomvB · July 18, 2019, 8:09am

It’s ok Stephane. Thanks for your help. I will leave it like this for now.
I tried to turn LXD-NODE1 and LXD-NODE2 both on with the db queries. LXD-NODE1 is still offline.

I think it won’t help. LXD-NODE1 is online and after a few seconds it’s going in offline state.

CyrusTheVirusG · July 18, 2019, 8:10am

After trying to patch I’m getting a: Error: cannot fetch node config from database: driver: bad connection

When issuing commands, on all nodes.

stgraber · July 18, 2019, 8:11am

Hmm, so prior to issuing this lxc config set, your database was properly responding and after it, you’re getting this error?

CyrusTheVirusG · July 18, 2019, 8:12am

Another sudo systemctl reload snap.lxd.daemon fixed it all good, thanks.

TomvB · July 18, 2019, 8:13am

lxc config set cluster.offline_threshold 259200 works fine here for now.

CyrusTheVirusG · July 18, 2019, 8:14am

Something strange though, seems to be an issue with this date.
“Excluding offline node from refresh: {ID:5 Address:10.0.-.-:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”