Cluster node appears offline after upgrade to 3.15

Can you show lxd sql global "SELECT * FROM nodes;" from any of the nodes?

lxd-node2:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:28:12.002876571+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:28:12.00301448+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node3:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:29:34.843975163+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:29:34.844244343+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

lxd-node1:
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:30:27.436329805+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:30:27.436646546+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

Ok, this is getting a bit annoying as we can’t easily fix the global table because of foreign key constraints.

I’m still not sure why the leader isn’t sending the expected ids though as that should cause all the other nodes to override their local raft_nodes table with the ids matching the nodes table, fixing this issue.

Can you try running systemctl reload snap.lxd.daemon on all 3 nodes?
That will cause a new leader to get elected for sure, causing the raft_nodes table to get updated everywhere and hopefully get the two lined up again.

Once you’ve run it on all 3 nodes, wait 20s and check the cluster list again.

Nope, still offline. :frowning:

What is the best thing to do now?
I can try to move the containers to node3 and reinstall node1 (if required to fix this).

Nah, that really shouldn’t be required, there’s fundamentally nothing wrong with that node other than some numbers being mixed up for some reason. We just need to get those to line up again and things will work fine then.

Can you once again show:

  • lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
  • lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

lxd-node3:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:48:29.627169572+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:48:29.627000301+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node2:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:49:44.337588653+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:49:44.337891201+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-----------------±-------±---------------±------------------------------------±--------+

±—±-----------------+
| id | address |
±—±-----------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |
±—±-----------------+

lxd-node1:
| id | name | description | address | schema | api_extensions | heartbeat | pending |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+
| 3 | lxd-node3 | | 192.168.1.3:8443 | 14 | 138 | 2019-07-18T08:51:58.090593895+02:00 | 0 |
| 4 | lxd-node2 | | 192.168.1.2:8443 | 14 | 138 | 2019-07-18T08:51:58.090794261+02:00 | 0 |
| 5 | lxd-node1 | | 192.168.1.1:8443 | 14 | 138 | 2019-07-18T01:38:30.60775629+02:00 | 0 |
±—±------±------------±-------------------±-------±---------------±------------------------------------±--------+

±—±-------------------+
| id | address |
±—±-------------------+
| 3 | 192.168.1.3:8443 |
| 5 | 192.168.1.2:8443 |
| 6 | 192.168.1.1:8443 |

On every node, run:

  • lxd sql local “UPDATE raft_nodes SET id=4 WHERE id=5;”
  • lxd sql local “UPDATE raft_nodes SET id=5 WHERE id=6;”

Then wait another 20s or so and see if things improve. This should make it so that the next round of heartbeats will include the proper IDs for all the database nodes, they will then all update their raft_nodes table to match and things should look happier.

(Doing this fix just on the leader should be sufficient but it’s easier to patch all 3 than figure out which is the leader :))

It seems to work again! Thanks

±------±---------------------------±---------±-------±------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±-------±------------------+
| lxd-node1 | https://192.168.1.1:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-node2 | https://192.168.1.2:8443 | NO | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+

It is still strange. I haven’t touched the cluster for a long time.

Wait.
https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m20.939295585s |

It’s automatically changing it?

Can you do the usual round of:

  • lxd sql global “SELECT * FROM nodes;” (on any one of the nodes)
  • lxd sql local “SELECT * FROM raft_nodes;” (on every one of the nodes)

I’d like to see if it reverted to the same value as earlier.

Yes, same as before.
https://discuss.linuxcontainers.org/t/lxd-3-15-has-been-released/5218/26?u=tomvb

Ah, right, I see what’s going on, we can’t live update that table because the leader effectively ignores it and relies on its live version of the raft state instead.

So we’ll need to use startup time DB patches instead to get this sorted once and for all.

On all 3 nodes, create a file at /var/snap/lxd/common/lxd/database/patch.local.sql containing:

UPDATE raft_nodes SET id=4 WHERE id=5;
UPDATE raft_nodes SET id=5 WHERE id=6;

Once the file is ready on all 3 nodes, run systemctl reload snap.lxd.daemon on all 3 nodes, in very quick succession (so one that hasn’t restarted yet won’t have the time to send bad data to the ones that have been updated).

They should then all come back up online with a sane raft_nodes table, go through leader election and then use the proper IDs for nodes moving forward.

@TomvB how did that go?

I’ve moved this thread from LXD 3.15 has been released into its own topic to keep things easier to search.

Nope. I tried it 3 times. It’s reverting the changes.

Can you confirm that the patch.local.sql file disappeared every time?

Anyway, lets try to avoid that race entirely:

  • Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes
  • Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile
  • Write the patch file again
  • Start LXD back up by running: lxc info

Note that you’ll need to run lxc info on at least two nodes before it will respond as it needs to get quorum for the database.

As all LXD daemons will be offline this time around, this should avoid the in-memory list of nodes messing things up for us.

(I really wonder how the two tables got out of sync in the first place, if we see others hitting this issue, we’ll need to find some automatic way of recovering from this)

Can you confirm that the patch.local.sql file disappeared every time?
Yes

  • Run kill $(cat /var/snap/lxd/common/lxd.pid) on all 3 nodes

Done

  • Confirm that LXD is offline after that with: ps aux | grep lxd.*logfile

Done

  • Write the patch file again

Done

  • Start LXD back up by running: lxc info

Hangs, did it with systemctl start snap.lxd.daemon (executed simultaneously on 3 nodes)

Result after a few seconds:
±------±---------------------------±---------±--------±-----------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node1 | https://192.168.1.1:8443 | YES | OFFLINE | no heartbeat since 2m19.402855264s |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node2 | https://192.168.1.2:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------+
| lxd-node3 | https://192.168.1.3:8443 | YES | ONLINE | fully operational |
±------±---------------------------±---------±--------±-----------------------------------

Same issue.