Ah, right, I see what’s going on, we can’t live update that table because the leader effectively ignores it and relies on its live version of the raft state instead.
So we’ll need to use startup time DB patches instead to get this sorted once and for all.
On all 3 nodes, create a file at /var/snap/lxd/common/lxd/database/patch.local.sql containing:
UPDATE raft_nodes SET id=4 WHERE id=5;
UPDATE raft_nodes SET id=5 WHERE id=6;
Once the file is ready on all 3 nodes, run systemctl reload snap.lxd.daemon on all 3 nodes, in very quick succession (so one that hasn’t restarted yet won’t have the time to send bad data to the ones that have been updated).
They should then all come back up online with a sane raft_nodes table, go through leader election and then use the proper IDs for nodes moving forward.
(I really wonder how the two tables got out of sync in the first place, if we see others hitting this issue, we’ll need to find some automatic way of recovering from this)
Ok, it’s 4am here and my brain is fried so I don’t think I’ll manage to sort this out now.
As far as I can tell, the only negative impact is on the heartbeats.
If you know all your nodes are online (as is the case here so far), try running:
lxc config set cluster.offline_threshold 259200
This will bump the offline threshold from the default of 20s to a rather long 3 days.
Everything should behave just fine with that, so long as no node goes offline, if one does, then it will not be detected.
This should get you out of the immediate issue until we sort this out.
It’s ok Stephane. Thanks for your help. I will leave it like this for now.
I tried to turn LXD-NODE1 and LXD-NODE2 both on with the db queries. LXD-NODE1 is still offline.
I think it won’t help. LXD-NODE1 is online and after a few seconds it’s going in offline state.
Something strange though, seems to be an issue with this date.
“Excluding offline node from refresh: {ID:5 Address:10.0.-.-:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”