Cluster node appears offline after upgrade to 3.15

TomvB · July 18, 2019, 4:57am

Aha,
lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s

stgraber · July 18, 2019, 5:00am

Did the systemctl reload snap.lxd.daemon on that node fix the issue?
If not, you’ll want to make sure that the revision in snap listmatches that of the other nodes, if that’s the case, then look at:

cat /var/snap/lxd/common/lxd/logs/lxd.log
journalctl -u snap.lxd.daemon -n 300

Hopefully that will give us some idea of what may be going on there.

TomvB · July 18, 2019, 5:15am

lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonical✓ -
core 16-2.39.3 7270 stable canonical✓ core
lxd 3.15 11254 stable canonical✓ -

lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonical✓ -
core 16-2.39.3 7270 stable canonical✓ core
lxd 3.15 11254 stable canonical✓ -

I don’t see any details/issues in the logging. Reboot doesn’t help either. stays offline. Only 1 node from the 3. The machine is clean without any extra’s (only the firewall).

t=2019-07-18T07:17:18+0200 lvl=warn msg=“Excluding offline node from refresh: {ID:5 Address:192.168.1.2:8443 Raft:true LastHeartbeat:2019-07-18 01:38:30.60775629 +0200 CEST Online:false updated:false}”
t=2019-07-18T07:17:18+0200 lvl=warn msg=“Excluding offline node from refresh: {ID:6 Address:192.168.1.1:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”

stgraber · July 18, 2019, 5:19am

Ok, can you do:

sudo systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
sudo rm /var/snap/lxd/common/lxd/unix.socket
sudo lxd --debug --group lxd

That should give a few more hints as to what’s going on.
In any case we should be able to blow the global database away and let it sync from node2, but it’d be better to know what’s going on before taking such drastic measures

TomvB · July 18, 2019, 5:25am

Logging: (from lxdnode 1 (192.168.1.1)
https://pastebin.com/6iBvvJcP

I think something went wrong in the DB on this machine. Everything works fine when trying lxc list from node1. It’s just ‘offline’ for all nodes.

How can I fix the db?

stgraber · July 18, 2019, 5:41am

The fact that LXD started properly makes it pretty unlikely that it’s just a bad database node, so I’d be pretty careful about just forcing a re-sync of the database at this point.

Can you show for every single node:

lxd sql local “SELECT * FROM raft_nodes;”
lxd sql global “SELECT * FROM nodes;”
lxc cluster list

This should make any database misconfiguration somewhat obvious, then we can take things from there.

stgraber · July 18, 2019, 5:42am

For the node you were manually debugging, you’ll want to ctrl+c that command and then start LXD back up with:

systemctl start snap.lxd.daemon

TomvB · July 18, 2019, 5:56am

Output from nodes:
https://pastebin.com/Tcdig72A

stgraber · July 18, 2019, 6:02am

Ok, so your database actually looks fine, the issue appears to be that 192.168.1.1:8443 isn’t responding to the heartbeats from the cluster leader (currently 192.168.1.3:8443).

Can you run lxc monitor --pretty --type=logging on 192.168.1.3 for at least 20s and then give me the output?

Capturing that output for 20s should guarantee we see a full heartbeat cycle in there and can hopefully see why the leader considers 192.168.1.1:8443 to be unresponsive.

TomvB · July 18, 2019, 6:06am

Output: https://pastebin.com/p75zs9v5

stgraber · July 18, 2019, 6:08am

Wow, that’s very odd as that shows a successful heartbeat to 192.168.1.1:8443 but then it’s still considered to be offline. Let me check past output again, I must have missed something obvious in there…

stgraber · July 18, 2019, 6:11am

And indeed I did, the global and local node IDs don’t match up for some reason, so it’s failing to update the right database record…

Quick workaround to get things lined up again would be:

lxd sql global “UPDATE nodes SET id=6 WHERE address=‘192.168.1.1:8443’”
lxd sql global “UPDATE nodes SET id=5 WHERE address=‘192.168.1.2:8443’”

Then give it 20s for the heartbeats to do their magic and things should report as being fine again.

TomvB · July 18, 2019, 6:21am

On all nodes:
user@lxd-node2:~$ lxd sql global “UPDATE nodes SET id=5 WHERE address=‘192.168.1.2:8443’”
Rows affected: 0
user@@lxd-node2:~$ lxd sql global “UPDATE nodes SET id=6 WHERE address=‘192.168.1.1:8443’”
Rows affected: 0

stgraber · July 18, 2019, 6:24am

Can you show lxd sql global "SELECT * FROM nodes;" from any of the nodes?

TomvB · July 18, 2019, 6:31am

lxd-node2:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:28:12.002876571+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:28:12.00301448+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node3:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:29:34.843975163+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:29:34.844244343+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node1:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:30:27.436329805+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:30:27.436646546+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

stgraber · July 18, 2019, 6:36am

Ok, this is getting a bit annoying as we can’t easily fix the global table because of foreign key constraints.

I’m still not sure why the leader isn’t sending the expected ids though as that should cause all the other nodes to override their local raft_nodes table with the ids matching the nodes table, fixing this issue.

Can you try running systemctl reload snap.lxd.daemon on all 3 nodes?
That will cause a new leader to get elected for sure, causing the raft_nodes table to get updated everywhere and hopefully get the two lined up again.

Once you’ve run it on all 3 nodes, wait 20s and check the cluster list again.

TomvB · July 18, 2019, 6:45am

Nope, still offline.

What is the best thing to do now?
I can try to move the containers to node3 and reinstall node1 (if required to fix this).

stgraber · July 18, 2019, 6:47am

Nah, that really shouldn’t be required, there’s fundamentally nothing wrong with that node other than some numbers being mixed up for some reason. We just need to get those to line up again and things will work fine then.

Can you once again show:

lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

TomvB · July 18, 2019, 6:54am

lxd-node3:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:48:29.627169572+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:48:29.627000301+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node2:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:49:44.337588653+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:49:44.337891201+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node1:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:51:58.090593895+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:51:58.090794261+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-------------------+
| id | address |
±—±-------------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |

stgraber · July 18, 2019, 6:54am

On every node, run:

lxd sql local “UPDATE raft_nodes SET id=4 WHERE id=5;”
lxd sql local “UPDATE raft_nodes SET id=5 WHERE id=6;”

Then wait another 20s or so and see if things improve. This should make it so that the next round of heartbeats will include the proper IDs for all the database nodes, they will then all update their raft_nodes table to match and things should look happier.

(Doing this fix just on the leader should be sufficient but it’s easier to patch all 3 than figure out which is the leader :))