Aha,
lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s
Did the systemctl reload snap.lxd.daemon
on that node fix the issue?
If not, youâll want to make sure that the revision in snap list
matches that of the other nodes, if thatâs the case, then look at:
- cat /var/snap/lxd/common/lxd/logs/lxd.log
- journalctl -u snap.lxd.daemon -n 300
Hopefully that will give us some idea of what may be going on there.
lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonicalâ -
core 16-2.39.3 7270 stable canonicalâ core
lxd 3.15 11254 stable canonicalâ -
lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonicalâ -
core 16-2.39.3 7270 stable canonicalâ core
lxd 3.15 11254 stable canonicalâ -
I donât see any details/issues in the logging. Reboot doesnât help either. stays offline. Only 1 node from the 3. The machine is clean without any extraâs (only the firewall).
t=2019-07-18T07:17:18+0200 lvl=warn msg=âExcluding offline node from refresh: {ID:5 Address:192.168.1.2:8443 Raft:true LastHeartbeat:2019-07-18 01:38:30.60775629 +0200 CEST Online:false updated:false}â
t=2019-07-18T07:17:18+0200 lvl=warn msg=âExcluding offline node from refresh: {ID:6 Address:192.168.1.1:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}â
Ok, can you do:
- sudo systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
- sudo rm /var/snap/lxd/common/lxd/unix.socket
- sudo lxd --debug --group lxd
That should give a few more hints as to whatâs going on.
In any case we should be able to blow the global database away and let it sync from node2, but itâd be better to know whatâs going on before taking such drastic measures
Logging: (from lxdnode 1 (192.168.1.1)
https://pastebin.com/6iBvvJcP
I think something went wrong in the DB on this machine. Everything works fine when trying lxc list from node1. Itâs just âofflineâ for all nodes.
How can I fix the db?
The fact that LXD started properly makes it pretty unlikely that itâs just a bad database node, so Iâd be pretty careful about just forcing a re-sync of the database at this point.
Can you show for every single node:
- lxd sql local âSELECT * FROM raft_nodes;â
- lxd sql global âSELECT * FROM nodes;â
- lxc cluster list
This should make any database misconfiguration somewhat obvious, then we can take things from there.
For the node you were manually debugging, youâll want to ctrl+c that command and then start LXD back up with:
- systemctl start snap.lxd.daemon
Output from nodes:
https://pastebin.com/Tcdig72A
Ok, so your database actually looks fine, the issue appears to be that 192.168.1.1:8443
isnât responding to the heartbeats from the cluster leader (currently 192.168.1.3:8443
).
Can you run lxc monitor --pretty --type=logging
on 192.168.1.3 for at least 20s and then give me the output?
Capturing that output for 20s should guarantee we see a full heartbeat cycle in there and can hopefully see why the leader considers 192.168.1.1:8443 to be unresponsive.
Output: https://pastebin.com/p75zs9v5
Wow, thatâs very odd as that shows a successful heartbeat to 192.168.1.1:8443
but then itâs still considered to be offline. Let me check past output again, I must have missed something obvious in thereâŠ
And indeed I did, the global and local node IDs donât match up for some reason, so itâs failing to update the right database recordâŠ
Quick workaround to get things lined up again would be:
- lxd sql global âUPDATE nodes SET id=6 WHERE address=â192.168.1.1:8443ââ
- lxd sql global âUPDATE nodes SET id=5 WHERE address=â192.168.1.2:8443ââ
Then give it 20s for the heartbeats to do their magic and things should report as being fine again.
| lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 6h42m41.505302667s |
±------±---------------------------±---------±--------±--------------------------------------+
| lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±--------------------------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±--------------------------------------
On all nodes:
user@lxd-node2:~$ lxd sql global âUPDATE nodes SET id=5 WHERE address=â192.168.1.2:8443ââ
Rows affected: 0
user@@lxd-node2:~$ lxd sql global âUPDATE nodes SET id=6 WHERE address=â192.168.1.1:8443ââ
Rows affected: 0
Can you show lxd sql global "SELECT * FROM nodes;"
from any of the nodes?
lxd-node2:
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:28:12.002876571+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:28:12.00301448+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
lxd-node3:
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:29:34.843975163+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:29:34.844244343+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
lxd-node1:
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:30:27.436329805+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:30:27.436646546+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
Ok, this is getting a bit annoying as we canât easily fix the global table because of foreign key constraints.
Iâm still not sure why the leader isnât sending the expected ids though as that should cause all the other nodes to override their local raft_nodes table with the ids matching the nodes table, fixing this issue.
Can you try running systemctl reload snap.lxd.daemon
on all 3 nodes?
That will cause a new leader to get elected for sure, causing the raft_nodes table to get updated everywhere and hopefully get the two lined up again.
Once youâve run it on all 3 nodes, wait 20s and check the cluster list again.
Nope, still offline.
What is the best thing to do now?
I can try to move the containers to node3 and reinstall node1 (if required to fix this).
Nah, that really shouldnât be required, thereâs fundamentally nothing wrong with that node other than some numbers being mixed up for some reason. We just need to get those to line up again and things will work fine then.
Can you once again show:
- lxd sql global âSELECT * FROM nodes;â (on any one of the nodes)
- lxd sql local âSELECT * FROM raft_nodes;â (on every one of the nodes)
lxd-node3:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-----------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:48:29.627169572+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:48:29.627000301+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
±â±-----------------+
| id | address |
±â±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±â±-----------------+
lxd-node2:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:49:44.337588653+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:49:44.337891201+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-----------------±-------±---------------±------------------------------------±--------+
±â±-----------------+
| id | address |
±â±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±â±-----------------+
lxd-node1:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:51:58.090593895+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:51:58.090794261+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±â±------±------------±-------------------±-------±---------------±------------------------------------±--------+
±â±-------------------+
| id | address |
±â±-------------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
On every node, run:
- lxd sql local âUPDATE raft_nodes SET id=4 WHERE id=5;â
- lxd sql local âUPDATE raft_nodes SET id=5 WHERE id=6;â
Then wait another 20s or so and see if things improve. This should make it so that the next round of heartbeats will include the proper IDs for all the database nodes, they will then all update their raft_nodes table to match and things should look happier.
(Doing this fix just on the leader should be sufficient but itâs easier to patch all 3 than figure out which is the leader :))