Cluster node appears offline after upgrade to 3.15

Aha,
lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s

Did the systemctl reload snap.lxd.daemon on that node fix the issue?
If not, you’ll want to make sure that the revision in snap listmatches that of the other nodes, if that’s the case, then look at:

  • cat /var/snap/lxd/common/lxd/logs/lxd.log
  • journalctl -u snap.lxd.daemon -n 300

Hopefully that will give us some idea of what may be going on there.

lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 5h18m27.10067886s
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonical✓ -
core 16-2.39.3 7270 stable canonical✓ core
lxd 3.15 11254 stable canonical✓ -

lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational
Name Version Rev Tracking Publisher Notes
canonical-livepatch 9.4.1 81 stable canonical✓ -
core 16-2.39.3 7270 stable canonical✓ core
lxd 3.15 11254 stable canonical✓ -

I don’t see any details/issues in the logging. Reboot doesn’t help either. stays offline. Only 1 node from the 3. The machine is clean without any extra’s (only the firewall).

t=2019-07-18T07:17:18+0200 lvl=warn msg=“Excluding offline node from refresh: {ID:5 Address:192.168.1.2:8443 Raft:true LastHeartbeat:2019-07-18 01:38:30.60775629 +0200 CEST Online:false updated:false}”
t=2019-07-18T07:17:18+0200 lvl=warn msg=“Excluding offline node from refresh: {ID:6 Address:192.168.1.1:8443 Raft:true LastHeartbeat:0001-01-01 00:00:00 +0000 UTC Online:false updated:false}”

Ok, can you do:

  • sudo systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
  • sudo rm /var/snap/lxd/common/lxd/unix.socket
  • sudo lxd --debug --group lxd

That should give a few more hints as to what’s going on.
In any case we should be able to blow the global database away and let it sync from node2, but it’d be better to know what’s going on before taking such drastic measures :slight_smile:

1 Like

Logging: (from lxdnode 1 (192.168.1.1)
https://pastebin.com/6iBvvJcP

I think something went wrong in the DB on this machine. Everything works fine when trying lxc list from node1. It’s just ‘offline’ for all nodes.

How can I fix the db?

The fact that LXD started properly makes it pretty unlikely that it’s just a bad database node, so I’d be pretty careful about just forcing a re-sync of the database at this point.

Can you show for every single node:

  • lxd sql local “SELECT * FROM raft_nodes;”
  • lxd sql global “SELECT * FROM nodes;”
  • lxc cluster list

This should make any database misconfiguration somewhat obvious, then we can take things from there.

For the node you were manually debugging, you’ll want to ctrl+c that command and then start LXD back up with:

  • systemctl start snap.lxd.daemon

Output from nodes:
https://pastebin.com/Tcdig72A

Ok, so your database actually looks fine, the issue appears to be that 192.168.1.1:8443 isn’t responding to the heartbeats from the cluster leader (currently 192.168.1.3:8443).

Can you run lxc monitor --pretty --type=logging on 192.168.1.3 for at least 20s and then give me the output?

Capturing that output for 20s should guarantee we see a full heartbeat cycle in there and can hopefully see why the leader considers 192.168.1.1:8443 to be unresponsive.

Output: https://pastebin.com/p75zs9v5

Wow, that’s very odd as that shows a successful heartbeat to 192.168.1.1:8443 but then it’s still considered to be offline. Let me check past output again, I must have missed something obvious in there


And indeed I did, the global and local node IDs don’t match up for some reason, so it’s failing to update the right database record


Quick workaround to get things lined up again would be:

  • lxd sql global “UPDATE nodes SET id=6 WHERE address=‘192.168.1.1:8443’”
  • lxd sql global “UPDATE nodes SET id=5 WHERE address=‘192.168.1.2:8443’”

Then give it 20s for the heartbeats to do their magic and things should report as being fine again.

:frowning:
| lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 6h42m41.505302667s |
±------±---------------------------±---------±--------±--------------------------------------+
| lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±--------------------------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±--------------------------------------

On all nodes:
user@lxd-node2:~$ lxd sql global “UPDATE nodes SET id=5 WHERE address=‘192.168.1.2:8443’”
Rows affected: 0
user@@lxd-node2:~$ lxd sql global “UPDATE nodes SET id=6 WHERE address=‘192.168.1.1:8443’”
Rows affected: 0

Can you show lxd sql global "SELECT * FROM nodes;" from any of the nodes?

lxd-node2:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:28:12.002876571+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:28:12.00301448+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node3:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:29:34.843975163+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:29:34.844244343+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node1:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:30:27.436329805+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:30:27.436646546+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

Ok, this is getting a bit annoying as we can’t easily fix the global table because of foreign key constraints.

I’m still not sure why the leader isn’t sending the expected ids though as that should cause all the other nodes to override their local raft_nodes table with the ids matching the nodes table, fixing this issue.

Can you try running systemctl reload snap.lxd.daemon on all 3 nodes?
That will cause a new leader to get elected for sure, causing the raft_nodes table to get updated everywhere and hopefully get the two lined up again.

Once you’ve run it on all 3 nodes, wait 20s and check the cluster list again.

Nope, still offline. :frowning:

What is the best thing to do now?
I can try to move the containers to node3 and reinstall node1 (if required to fix this).

Nah, that really shouldn’t be required, there’s fundamentally nothing wrong with that node other than some numbers being mixed up for some reason. We just need to get those to line up again and things will work fine then.

Can you once again show:

  • lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
  • lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

lxd-node3:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:48:29.627169572+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:48:29.627000301+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node2:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:49:44.337588653+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:49:44.337891201+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node1:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:51:58.090593895+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:51:58.090794261+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-------------------+
| id | address |
±—±-------------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |

On every node, run:

  • lxd sql local “UPDATE raft_nodes SET id=4 WHERE id=5;”
  • lxd sql local “UPDATE raft_nodes SET id=5 WHERE id=6;”

Then wait another 20s or so and see if things improve. This should make it so that the next round of heartbeats will include the proper IDs for all the database nodes, they will then all update their raft_nodes table to match and things should look happier.

(Doing this fix just on the leader should be sufficient but it’s easier to patch all 3 than figure out which is the leader :))