Lxd 3.21: desynced cluster

The troublesome LXD cluster saga continues:

all other cluster members see this:

ubuntu@aa1-cptef101-n3:~$ lxc cluster ls
+-----------------+--------------------------+----------+--------+-------------------+--------------+
|      NAME       |           URL            | DATABASE | STATE  |      MESSAGE      | 
ARCHITECTURE |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n1 | https://10.224.1.11:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n2 | https://10.224.1.12:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n3 | https://10.224.1.13:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n4 | https://10.224.1.14:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n1 | https://10.224.1.21:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n2 | https://10.224.1.22:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n3 | https://10.224.1.23:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n4 | https://10.224.1.24:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+

aa1-cptef101-n2 sees this:

root@aa1-cptef101-n2:~# lxc cluster ls
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| NAME | URL | DATABASE | STATE | MESSAGE |
ARCHITECTURE |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef101-n1 | https://10.224.1.11:8443 | NO | OFFLINE | no heartbeat since
27.552492624s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef101-n2 | https://10.224.1.12:8443 | NO | OFFLINE | no heartbeat since
27.553664584s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef101-n3 | https://10.224.1.13:8443 | YES | OFFLINE | no heartbeat since
27.553363184s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef101-n4 | https://10.224.1.14:8443 | YES | OFFLINE | no heartbeat since
27.553172404s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef102-n1 | https://10.224.1.21:8443 | YES | OFFLINE | no heartbeat since
27.552999854s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef102-n2 | https://10.224.1.22:8443 | YES | OFFLINE | no heartbeat since
27.552827674s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef102-n3 | https://10.224.1.23:8443 | NO | OFFLINE | no heartbeat since
27.552713144s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+
| aa1-cptef102-n4 | https://10.224.1.24:8443 | NO | OFFLINE | no heartbeat since
27.552588354s | x86_64 |
±----------------±-------------------------±---------±--------±---------------------------------±------------
-+

aa1-cptef101-n2 will intermittently see the rest of the cluster, but will immediately fall back to thinking the rest of the cluster is offline, while this happens the other cluster members will show it as online.

The weird part is we can still exec and edit configs (which take effect) so it’s not a complete desync.

I will happily provide more info - I am just not immediately sure what more info is necessary.

@freeekanayaka

Can you confirm that the system time is identical on all machines?
The above could be explained by having that last server have a 30s or so clock drift.

LXD clusters should always run with a functional NTP daemon to guarantee all servers have the same time (or close enough to it).

1 Like

Turns out it WAS clock drift and the ntpd had been failing.

Good, got to like it when things make sense :slight_smile:

2 Likes