LXD 3.01 - Cluster - Offline

Trying out the new clustering technology in LXD 3.01 per Stephane’s YouTube video (https://www.youtube.com/watch?v=RnBu7t2wD4U)

I installed Ubuntu 18.04 on 4 nodes - one master node and three compute nodes - all identical installs. On all nodes, I purged all the old LXD software (apt remove --purge lxd lxd-client liblxc1 lxcfs), then installed the snap version (snap install lxd).

On the master node, I ran the “lxd init” setup script and took most of the default values (local disk, standard networking, etc). I then added the three compute nodes in a similar way. In the end, all 4 nodes appear in the “lxd cluster list” command.

However, on two of the four nodes (specifically compute-2 and compute-3), I see the cluster members as Offline when I run the “lxd cluster list” command. The nodes are definitely online and running some containers.

Example output on Master node:

root@LXD-3-Manager:~# lxc cluster list
+---------------+--------------------------+----------+--------+-------------------+
|     NAME      |           URL            | DATABASE | STATE  |      MESSAGE      |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-3-Manager | https://10.30.50.60:8443 | YES      | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server1   | https://10.30.50.61:8443 | YES      | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server2   | https://10.30.50.62:8443 | YES      | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+
| LXD-Server3   | https://10.30.50.63:8443 | NO       | ONLINE | fully operational |
+---------------+--------------------------+----------+--------+-------------------+

Example output on LXD-Server2:

root@LXD-Server2:~# lxc cluster list
+---------------+--------------------------+----------+---------+----------------------------------+
|     NAME      |           URL            | DATABASE |  STATE  |             MESSAGE              |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-3-Manager | https://10.30.50.60:8443 | YES      | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server1   | https://10.30.50.61:8443 | YES      | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server2   | https://10.30.50.62:8443 | YES      | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+
| LXD-Server3   | https://10.30.50.63:8443 | NO       | OFFLINE | no heartbeat since 38.837393948s |
+---------------+--------------------------+----------+---------+----------------------------------+

Just trying to understand why the other servers show the cluster offline.

An interesting data point - running the “lxc cluster list” command on Server2 shows different heartbeat timeout values - but those values go up and down. For example, the first output gives “no heartbeat since 38.837393948s”, the second time it gives “no heartbeat since 37.059828666s”, the third time it gives “no heartbeat since 36.299778595s”, etc. If the node were truly down, I would expect the heartbeat timeouts to continue to rise - not rise and fall.

That’s a bit odd, can you post lxc info from those machines?

Also, you may want to make sure that the clock of all 4 systems is in sync just in case that messes with the hearbeats somehow.

1 Like

Thanks Stephane. The clock was indeed the issue. The “Online” servers were about 5secs off while the “Offline” servers were about 45 secs off.

Seems we might need an enhancement to the clustering software to sync the clocks during cluster init (or, at least, let the user know the clocks are not in sync).