How to fix only having two database nodes in cluster

mt-caret · December 24, 2019, 6:59am

I’m currently administering an 11-node lxd cluster (Ubuntu 18.04 with LXD 3.18),
and I get the following when I run lxc cluster ls:

$ lxc cluster ls
+---------+--------------------------+----------+--------+-------------------+
|  NAME   |           URL            | DATABASE | STATE  |      MESSAGE      |
+---------+--------------------------+----------+--------+-------------------+
| chino   | https://172.16.0.6:8443  | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| cocoa   | https://172.16.0.7:8443  | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| hitagi  | https://172.16.0.13:8443 | YES      | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| mayoi   | https://172.16.0.16:8443 | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| nadeko  | https://172.16.0.18:8443 | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| rize    | https://172.16.0.8:8443  | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| shinobu | https://172.16.0.15:8443 | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| suruga  | https://172.16.0.17:8443 | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| tippy   | https://172.16.0.5:8443  | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| tsubasa | https://172.16.0.14:8443 | YES      | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+
| tsukihi | https://172.16.0.20:8443 | NO       | ONLINE | fully operational |
+---------+--------------------------+----------+--------+-------------------+

From what I gather from https://discuss.linuxcontainers.org/t/when-turning-off-the-first-lxd-cluster-node-no-available-dqlite-leader-server-found/4084,
I’m assuming it’s not intended behavior for only
two database nodes to be present in a cluster.

As reported in linked post, turning off a database node results in lxc commands
becoming unresponsive on all nodes with similar error messages.
I can’t recall the exact sequence in which I setup the cluster, but from what I remember
I suspect it is related to the (planned) third node hanging when trying to turn it into
a cluster node via lxd init --preseed.

The hanged node was unresponsive to all the usual attempts to fix it
(stopping lxd, uninstalling it via snap), so we eventually reinstalled ubuntu from scratch.
I suspect this hang is due to how having an empty host in core.https_address (i.e. lxc config set core.https_address :8443 rather than lxc config set core.https_address 172.16.0.14:8443) is valid for non-cluster nodes but invalid for cluster nodes, and
running lxd init --preseed with a cluster config on a node with such a config hangs lxd.
I haven’t got around to reproducing this, so I’ll update this post once I do.

In the meantime, I would love to be able to restore the third database node; can anybody help?

freeekanayaka · December 24, 2019, 10:06pm

If you are lucky, adding a new node to the cluster (or removing and re-adding an existing one) might turn it into a new database node. If not, it means that the hung node you mention got committed in the raft log and would probably need to be removed by hand.

What’s the output of:

lxd sql local "SELECT * FROM raft_nodes"

?

I’m currently on holiday, so I don’t think I’ll be able to provide much help for a while.

mt-caret · December 25, 2019, 1:47am

Thanks for the response!

Ok, so I took a look at the outputs for lxd sql local "SELECT * FROM raft_nodes" on all nodes, and on all nodes but one, I got:

+----+------------------+
| id |     address      |
+----+------------------+
| 1  | 172.16.0.13:8443 |
| 2  | 172.16.0.14:8443 |
+----+------------------+

but on one (incidentally, the node with IP 172.16.0.14), I got:

+----+------------------+
| id |     address      |
+----+------------------+
| 1  | 172.16.0.13:8443 |
| 2  | 172.16.0.14:8443 |
| 3  | :8443            |
+----+------------------+

I’m guessing this is the culprit.
I stopped lxd, ran lxd sql local "DELETE FROM raft_nodes WHERE id=3"
and started lxd to get rid of it.
Is there a way to manually add a node as a database node,
or should it recover on its own given some time?

freeekanayaka · December 25, 2019, 7:30pm

There’s currently no way to force a node to be a database node. We’re going to improve this area in the near future tho.

It’s very weird that you got a node with 3 entries and the other with 2. If can try to delete that entry in that node, then add a new node to the cluster and see if it becomes a database node.

TomvB · December 28, 2019, 8:51pm

Today a short check of my LXD cluster:
lxc cluster list
±------±---------------------------±---------±-------±------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±------±---------------------------±---------±-------±------------------+
| lxd-1 | https://192.168.100.1:8443 | NO | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-2 | https://192.168.100.2:8443 | NO | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+
| lxd-3 | https://192.168.100.3:8443 | NO | ONLINE | fully operational |
±------±---------------------------±---------±-------±------------------+

And there is no database server at all. @freeekanayaka

Update:
I’ll stop with LXD clustering and go back to standalone nodes. I have too many problems with the LXD clusters in production. I don’t touch the servers and unfortunately it breaks too many times without any change.

mt-caret · March 18, 2020, 9:35am

Today, I got around to removing a non-database node, reinstalling it from scratch, and re-adding it to the cluster, which fixed the problem. Thanks for the help!