Cluster node appears offline after upgrade to 3.15

It seems to work again! Thanks

±------±---------------------------±---------±-------±------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±-------±------------------+
| lxd-node1 | https://192.168.1.1:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-node2 | https://192.168.1.2:8443 | NO | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+

It is still strange. I haven’t touched the cluster for a long time.

Wait.
https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m20.939295585s |

It’s automatically changing it?

Can you do the usual round of:

  • lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
  • lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

I’d like to see if it reverted to the same value as earlier.

Yes, same as before.
https://discuss.linuxcontainers.org/t/lxd-3-15-has-been-released/5218/26?u=tomvb

Ah, right, I see what’s going on, we can’t live update that table because the leader effectively ignores it and relies on its live version of the raft state instead.

So we’ll need to use startup time DB patches instead to get this sorted once and for all.

On all 3 nodes, create a file at /var/snap/lxd/common/lxd/database/patch.local.sql containing:

UPDATE raft_nodes SET id=4 WHERE id=5;
UPDATE raft_nodes SET id=5 WHERE id=6;

Once the file is ready on all 3 nodes, run systemctl reload snap.lxd.daemon on all 3 nodes, in very quick succession (so one that hasn’t restarted yet won’t have the time to send bad data to the ones that have been updated).

They should then all come back up online with a sane raft_nodes table, go through leader election and then use the proper IDs for nodes moving forward.

@TomvB how did that go?

I’ve moved this thread from LXD 3.15 has been released into its own topic to keep things easier to search.

Nope. I tried it 3 times. It’s reverting the changes.

Can you confirm that the patch.local.sql file disappeared every time?

Anyway, lets try to avoid that race entirely:

  • Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes
  • Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile
  • Write the patch file again
  • Start LXD back up by running: lxc info

Note that you’ll need to run lxc info on at least two nodes before it will respond as it needs to get quorum for the database.

As all LXD daemons will be offline this time around, this should avoid the in-memory list of nodes messing things up for us.

(I really wonder how the two tables got out of sync in the first place, if we see others hitting this issue, we’ll need to find some automatic way of recovering from this)

Can you confirm that the patch.local.sql file disappeared every time?
Yes

  • Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes

Done

  • Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile

Done

  • Write the patch file again

Done

  • Start LXD back up by running: lxc info

Hangs, did it with systemctl start snap.lxd.daemon (executed simultaneously on 3 nodes)

Result after a few seconds:
±------±---------------------------±---------±--------±-----------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m19.402855264s |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------

Same issue.

Ok, it’s 4am here and my brain is fried so I don’t think I’ll manage to sort this out now.
As far as I can tell, the only negative impact is on the heartbeats.

If you know all your nodes are online (as is the case here so far), try running:

lxc config set cluster.offline_threshold 259200

This will bump the offline threshold from the default of 20s to a rather long 3 days.

Everything should behave just fine with that, so long as no node goes offline, if one does, then it will not be detected.

This should get you out of the immediate issue until we sort this out.

1 Like

It’s ok Stephane. Thanks for your help. I will leave it like this for now.
I tried to turn LXD-NODE1 and LXD-NODE2 both on with the db queries. LXD-NODE1 is still offline.

I think it won’t help. LXD-NODE1 is online and after a few seconds it’s going in offline state.

After trying to patch I’m getting a: Error: cannot fetch node config from database: driver: bad connection

When issuing commands, on all nodes.

Hmm, so prior to issuing this lxc config set, your database was properly responding and after it, you’re getting this error?

Another sudo systemctl reload snap.lxd.daemon fixed it all good, thanks.

lxc config set cluster.offline_threshold 259200 works fine here for now.

Something strange though, seems to be an issue with this date.
“Excluding offline node from refresh: {ID:5 Address:10.0.-.-:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”