Wait.
https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m20.939295585s |
It’s automatically changing it?
Wait.
https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m20.939295585s |
It’s automatically changing it?
Can you do the usual round of:
I’d like to see if it reverted to the same value as earlier.
Yes, same as before.
https://discuss.linuxcontainers.org/t/lxd-3-15-has-been-released/5218/26?u=tomvb
Ah, right, I see what’s going on, we can’t live update that table because the leader effectively ignores it and relies on its live version of the raft state instead.
So we’ll need to use startup time DB patches instead to get this sorted once and for all.
On all 3 nodes, create a file at /var/snap/lxd/common/lxd/database/patch.local.sql containing:
UPDATE raft_nodes SET id=4 WHERE id=5;
UPDATE raft_nodes SET id=5 WHERE id=6;
Once the file is ready on all 3 nodes, run systemctl reload snap.lxd.daemon
on all 3 nodes, in very quick succession (so one that hasn’t restarted yet won’t have the time to send bad data to the ones that have been updated).
They should then all come back up online with a sane raft_nodes table, go through leader election and then use the proper IDs for nodes moving forward.
I’ve moved this thread from LXD 3.15 has been released into its own topic to keep things easier to search.
Nope. I tried it 3 times. It’s reverting the changes.
Can you confirm that the patch.local.sql
file disappeared every time?
Anyway, lets try to avoid that race entirely:
kill $(cat /var/snap/lxd/common/lxd.pid)
on all 3 nodesNote that you’ll need to run lxc info
on at least two nodes before it will respond as it needs to get quorum for the database.
As all LXD daemons will be offline this time around, this should avoid the in-memory list of nodes messing things up for us.
(I really wonder how the two tables got out of sync in the first place, if we see others hitting this issue, we’ll need to find some automatic way of recovering from this)
Can you confirm that the patch.local.sql
file disappeared every time?
Yes
kill $(cat /var/snap/lxd/common/lxd.pid)
on all 3 nodesDone
Done
Done
Hangs, did it with systemctl start snap.lxd.daemon (executed simultaneously on 3 nodes)
Result after a few seconds:
±------±---------------------------±---------±--------±-----------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m19.402855264s |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------
Same issue.
Ok, it’s 4am here and my brain is fried so I don’t think I’ll manage to sort this out now.
As far as I can tell, the only negative impact is on the heartbeats.
If you know all your nodes are online (as is the case here so far), try running:
lxc config set cluster.offline_threshold 259200
This will bump the offline threshold from the default of 20s to a rather long 3 days.
Everything should behave just fine with that, so long as no node goes offline, if one does, then it will not be detected.
This should get you out of the immediate issue until we sort this out.
It’s ok Stephane. Thanks for your help. I will leave it like this for now.
I tried to turn LXD-NODE1 and LXD-NODE2 both on with the db queries. LXD-NODE1 is still offline.
I think it won’t help. LXD-NODE1 is online and after a few seconds it’s going in offline state.
After trying to patch I’m getting a: Error: cannot fetch node config from database: driver: bad connection
When issuing commands, on all nodes.
Hmm, so prior to issuing this lxc config set
, your database was properly responding and after it, you’re getting this error?
Another sudo systemctl reload snap.lxd.daemon fixed it all good, thanks.
lxc config set cluster.offline_threshold 259200 works fine here for now.
Something strange though, seems to be an issue with this date.
“Excluding offline node from refresh: {ID:5 Address:10.0.-.-:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”
That’s “expected” given the issue. We’re effectively not finding a match for the node due to the ID mismatch, so you’re getting the 0
date timestamp rather than a valid value.