Hangs when joining 3rd lxd node to 2 node cluster

When running the “sudo lxd init” with the snap version of LXD 4.0.1 with the setting below

> ubuntu@cn-svr-man-03 ~> sudo lxd init
Would you like to use LXD clustering? (yes/no) [default=no]: yes
What name should be used to identify this node in the cluster? [default=cn-svr-man-03]: 
What IP address or DNS name should be used to reach this node? [default=2a02:8010:6538:112:60af:98ff:fe22:ec95]: 172.28.185.42
Are you joining an existing cluster? (yes/no) [default=no]: yes
IP address or FQDN of an existing cluster node: 172.28.185.40
Cluster fingerprint: 3ba8cdfbe591d79423bd55135316974db6cf5c764071ad9d85abc1a91113306c
You can validate this fingerprint by running "lxc info" locally on an existing node.
Is this the correct fingerprint? (yes/no) [default=no]: yes
Cluster trust password: 
All existing data is lost when joining a cluster, continue? (yes/no) [default=no] yes
Choose "source" property for storage pool "local": rpool/lxd
Choose "zfs.pool_name" property for storage pool "local": 
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: 

It hangs indefinitely the output of the logs from the joining node is below.

ubuntu@cn-svr-man-03 ~> sudo cat /var/snap/lxd/common/lxd/logs/lxd.log
t=2020-05-02T15:14:42+0100 lvl=info msg=“LXD 4.0.1 is starting in normal mode” path=/var/snap/lxd/common/lxd
t=2020-05-02T15:14:42+0100 lvl=info msg=“Kernel uid/gid map:”
t=2020-05-02T15:14:42+0100 lvl=info msg=" - u 0 0 4294967295"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - g 0 0 4294967295"
t=2020-05-02T15:14:42+0100 lvl=info msg=“Configured LXD uid/gid map:”
t=2020-05-02T15:14:42+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-05-02T15:14:42+0100 lvl=info msg=“Kernel features:”
t=2020-05-02T15:14:42+0100 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - uevent injection: yes"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - seccomp listener: yes"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - seccomp listener continue syscalls: yes"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - cgroup layout: hybrid"
t=2020-05-02T15:14:42+0100 lvl=warn msg=" - Couldn’t find the CGroup blkio.weight, I/O weight limits will be ignored"
t=2020-05-02T15:14:42+0100 lvl=warn msg=" - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-05-02T15:14:42+0100 lvl=info msg=" - shiftfs support: disabled"
t=2020-05-02T15:14:42+0100 lvl=info msg=“Initializing local database”
t=2020-05-02T15:14:42+0100 lvl=info msg=“Starting /dev/lxd handler:”
t=2020-05-02T15:14:42+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-05-02T15:14:42+0100 lvl=info msg=“REST API daemon:”
t=2020-05-02T15:14:42+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-05-02T15:14:42+0100 lvl=info msg=“Initializing global database”
t=2020-05-02T15:14:45+0100 lvl=eror msg=“Firewall nftables unable to parse existing ruleset: invalid character ‘^’ looking for beginning of value”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Firewall loaded driver “xtables””
t=2020-05-02T15:14:45+0100 lvl=info msg=“Initializing storage pools”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Initializing daemon storage mounts”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Initializing networks”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Pruning leftover image files”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done pruning leftover image files”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Loading daemon configuration”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Started seccomp handler” path=/var/snap/lxd/common/lxd/seccomp.socket
t=2020-05-02T15:14:45+0100 lvl=info msg=“Pruning expired images”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done pruning expired images”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Pruning expired instance backups”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done pruning expired instance backups”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Updating images”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done updating images”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Expiring log files”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done expiring log files”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Updating instance types”
t=2020-05-02T15:14:45+0100 lvl=info msg=“Done updating instance types”
t=2020-05-02T15:15:23+0100 lvl=info msg=“Update network address”
t=2020-05-02T15:15:23+0100 lvl=info msg=" - binding TCP socket" socket=172.28.185.42:8443
t=2020-05-02T15:15:26+0100 lvl=eror msg=“Failed to get leader node address: Node is not clustered”
t=2020-05-02T15:15:26+0100 lvl=info msg=“Joining dqlite raft cluster” address=172.28.185.42:8443 id=4 role=voter
t=2020-05-02T15:15:26+0100 lvl=warn msg=“Dqlite client proxy TLS -> Unix: read tcp 172.28.185.42:60556->172.28.185.40:8443: use of closed network connection”
t=2020-05-02T15:15:26+0100 lvl=warn msg=“Dqlite client proxy TLS -> Unix: read tcp 172.28.185.42:43940->172.28.185.41:8443: use of closed network connection”
t=2020-05-02T15:15:45+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.40:8443 attempt=0”
t=2020-05-02T15:15:45+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.42:8443 attempt=0”
t=2020-05-02T15:15:45+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.40:8443 attempt=1”
t=2020-05-02T15:15:46+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.42:8443 attempt=1”
t=2020-05-02T15:15:46+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.40:8443 attempt=2”
t=2020-05-02T15:15:46+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.42:8443 attempt=2”
t=2020-05-02T15:15:47+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.40:8443 attempt=3”
t=2020-05-02T15:15:47+0100 lvl=warn msg=“Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=172.28.185.42:8443 attempt=3”

both Nodes have open ports for 8443 on the IPv4 address and can ping each other so not sure why they are failing to communicate.

The only explanation I can think of is that neither 172.28.185.40 (one of your existing nodes) nor 172.28.185.42 is the leader, and 172.28.185.41 (which I presume would be your other existing node) is not being considered as possible leader because it should still be a stand-by node.

Assuming that’s the explanation (again, I’m not sure it is), the question is how did you end up there, since it shouldn’t be possible, because in the moment you join the 3rd node you should have just one leader, which is the initial cluster node.

In any case, I’ve added some fixes to LXD and dqlite that should handle this situation as well, hopefully that will help in the future (they’ll be available in 4.0.2 and 4.1).

For your situation, I recommend to destroy the cluster and start over from scratch, taking note of all the commands you run from the very start. I’d be surprised if it happens again, but in that case at least we’d have more details.

If starting from scratch is not an option, let me know and we’ll see if it’s possible to recover your cluster.

I would like to try and recover the cluster as I have 18 containers currently configured and would like to avoid getting them imported to a new cluster. If waiting until 4.0.2 will possibly fix this then I can wait. Do you know when its likely to be released?

First we need to check that your cluster didn’t get into some inconsistent state. What does:

lxd sql local "SELECT * FROM raft_nodes"

return? On the two nodes that currently belong to the cluster.

172.28.185.40
lxd sql local “SELECT * FROM raft_nodes”
±—±-------------------±-----+
| id | address | role |
±—±-------------------±-----+
| 3 | 172.28.185.40:8443 | 0 |
| 4 | 172.28.185.41:8443 | 0 |
±—±-------------------±-----+

172.28.185.41
lxd sql local “SELECT * FROM raft_nodes;”
±—±-------------------±-----+ | id | address | role |
±—±-------------------±-----+
| 3 | 172.28.185.40:8443 | 0 |
| 4 | 172.28.185.41:8443 | 0 |
±—±-------------------±-----+

This seems fine.

What about:

lxd sql global "SELECT * FROM nodes"

?

If it shows a spurious entry for the third node (172.28.185.42), you could try clean it up with:

lxd sql global "DELETE FROM nodes WHERE address = '172.28.185.42'

and then just attempt to join the third node (172.28.185.42) again and see how it goes.

Thanks for you help with this.

lxd sql global “SELECT * FROM nodes” ±—±--------------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| id | name | description | address | schema | api_extensions | heartbeat | pending | arch |
±—±--------------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+
| 2 | cn-svr-man-01 | | 172.28.185.40:8443 | 28 | 185 | 2020-05-06T23:13:02.335024044+01:00 | 0 | 2 |
| 3 | cn-svr-man-02 | | 172.28.185.41:8443 | 28 | 185 | 2020-05-06T23:13:02.335293124+01:00 | 0 | 2 |
±—±--------------±------------±-------------------±-------±---------------±------------------------------------±--------±-----+

Ok, this looks fine too, you don’t have spurious entries for 172.28.185.42.

You can try to wipe completely 172.28.185.42 (e.g. with snap remove lxd; snap install lxd) and then attempt again to join it to the cluster.

The only non-normal thing that you have is that both nodes in your current cluster have the voter role, which usually isn’t the case for a 2-node cluster. It shouldn’t be a problem though.

Sorry I’ve run snap remove lxd; snap install lxd many times
while troubleshooting this and it always hangs during the “sudo
lxd init”

Okay, that means I might be able to reproduce the issue. I’ll do a few tests and get back to you.

It seems I can reproduce it. I will work on a fix.

Actually I take it back, at least partially, as I’m not quite able to reproduce your exact situation. Please can you try to:

  1. Remove the lxd snap from your third node 172.28.185.42 which does not belong to the cluster

  2. Stop both your two current cluster nodes (172.28.185.40 and 172.28.185.41), using systemctl stop snap.lxd.daemon (and then make sure that the daemon is not running anymore, e.g. with ps aux | grep lxd)

  3. Make sure that debug logging is enabled on both your two cluster nodes, snap set lxd daemon.debug=true.

  4. Start again lxd on those two nodes, using systemctl start snap.lxd.daemon

  5. Install again the snap on the third node, set debug mode on it (snap set lxd daemon.debug=true) and try again to join.

Assuming that joining fails, please can you send me the logs of all three nodes? You find them under /var/snap/lxd/common/lxd/logs/lxd.log.

The cluster join worked this time. the existing node
172.28.185.41 did hang “sudo systemctl stop snap.lxd.daemon” and
after leaving it for 10 minutes I ended up having to kill the PID
before I could start it.

  How long does a cluster during start up have to get quorum? I

originally had a 2 node cluster which I now no is bad practices
and why I was trying to add a 3rd. Is it possible that if a 2 node
cluster is not started at the same time them it’ll fail to get
quorum. What happens when a single node is rebooted in a 2 node
cluster?

Thanks
Mat

Normally in a 2-node cluster you get only one voting database node: you start a new cluster with a single node which will be your voting database node, then you add a second node which will not be a voting database node immediately. The second node is promoted to be a voting database node only when you add a third node, at that point you have 3 voting database nodes.

I’m not sure what happened in your situation, probably something went wrong when you added the third node the first time.

Thanks for your help on getting me up and running. I think I was reboot each of the 2 nodes independently and then losing the cluster leader. The strange thing is I was still able to manage the containers with the 2 node clusters so looks like I was losing a cluster leader but not quorum.