Trying out the new clustering technology in LXD 3.01 per Stephane’s YouTube video (https://www.youtube.com/watch?v=RnBu7t2wD4U)
I installed Ubuntu 18.04 on 4 nodes - one master node and three compute nodes - all identical installs. On all nodes, I purged all the old LXD software (apt remove --purge lxd lxd-client liblxc1 lxcfs), then installed the snap version (snap install lxd).
On the master node, I ran the “lxd init” setup script and took most of the default values (local disk, standard networking, etc). I then added the three compute nodes in a similar way. In the end, all 4 nodes appear in the “lxd cluster list” command.
However, on two of the four nodes (specifically compute-2 and compute-3), I see the cluster members as Offline when I run the “lxd cluster list” command. The nodes are definitely online and running some containers.
Example output on Master node:
root@LXD-3-Manager:~# lxc cluster list
+---------------+--------------------------+----------+--------+-------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-3-Manager | https://10.30.50.60:8443 | YES | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server1 | https://10.30.50.61:8443 | YES | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server2 | https://10.30.50.62:8443 | YES | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server3 | https://10.30.50.63:8443 | NO | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
Example output on LXD-Server2:
root@LXD-Server2:~# lxc cluster list
+---------------+--------------------------+----------+---------+----------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-3-Manager | https://10.30.50.60:8443 | YES | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server1 | https://10.30.50.61:8443 | YES | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server2 | https://10.30.50.62:8443 | YES | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server3 | https://10.30.50.63:8443 | NO | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
Just trying to understand why the other servers show the cluster offline.
An interesting data point - running the “lxc cluster list” command on Server2 shows different heartbeat timeout values - but those values go up and down. For example, the first output gives “no heartbeat since 38.837393948s”, the second time it gives “no heartbeat since 37.059828666s”, the third time it gives “no heartbeat since 36.299778595s”, etc. If the node were truly down, I would expect the heartbeat timeouts to continue to rise - not rise and fall.