We have an LXD cluster set up with 4 member nodes.
lxc cluster ls
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-0 | https://glf-science-0.ent.dfo-mpo.ca:8443 | database-standby | x86_64 | default | | OFFLINE | No heartbeat for 7h40m7.464691558s (2023-06-23 04:17:02.595944552 +0000 UTC) |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-1 | https://glf-science-1.ENT.dfo-mpo.ca:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-2 | https://glf-science-2.ent.dfo-mpo.ca:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-3 | https://glf-science-3.ENT.dfo-mpo.ca:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
One of the nodes is giving us difficulty, and we are looking for some help troubleshooting. All four members are running Ubuntu 22.04 and the same LXD version (5.14)
Here is a description of the problematic behaviour:
- No problems encountered when joining first three members. When attempting to bring in
glf-science-0
, the process stalled several times on thelxd init
. After each attempt to join, there the cluster’s database leader would have a raft node corresponding to theglf-science-0
address, however it would not show up in thelxc cluster list
from anywhere within cluster. - After removing bad raft node and rerunning the
lxd init
on a clean install,glf-science-0
successfully made it into the cluster. When checking thelxc cluster list
, all four nodes were online but only for a few seconds. Since then,glf-science-0
has been offline. - Communication between the four members is asymmetrical. For example, running
lxc list
from withinglf-science-0
results in the following:
While running it from any of the other members results in this:+--------------------+---------+----------------------+------+-----------+-----------+---------------+ | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | andes-preprod-east | RUNNING | 240.6.142.86 (eth0) | | CONTAINER | 2 | glf-science-1 | | | | 142.130.6.217 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | dmapps | RUNNING | 240.4.21.42 (eth0) | | CONTAINER | 1 | glf-science-0 | | | | 142.130.4.27 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | gulf-fisheriescape | RUNNING | 240.6.35.204 (eth0) | | CONTAINER | 0 | glf-science-3 | | | | 142.130.6.173 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | r-shiny-science | RUNNING | 240.6.142.141 (eth0) | | CONTAINER | 0 | glf-science-1 | | | | 142.130.6.230 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+
The container named+--------------------+---------+----------------------+------+-----------+-----------+---------------+ | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | andes-preprod-east | RUNNING | 240.6.142.86 (eth0) | | CONTAINER | 2 | glf-science-1 | | | | 142.130.6.217 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | dmapps | ERROR | | | CONTAINER | 0 | glf-science-0 | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | gulf-fisheriescape | RUNNING | 240.6.35.204 (eth0) | | CONTAINER | 0 | glf-science-3 | | | | 142.130.6.173 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+ | r-shiny-science | RUNNING | 240.6.142.141 (eth0) | | CONTAINER | 0 | glf-science-1 | | | | 142.130.6.230 (eth1) | | | | | +--------------------+---------+----------------------+------+-----------+-----------+---------------+
dmapps
was created on theglf-science-0
node and while the other nodes seem to know of its existence, they cannot seem to reach it vialxd
API. The container address (i.e.,240.4.21.42
) is pingable from all four cluster nodes.
We are suspecting this has to do with accessibility of the glf-science-0
machine on port 8443 however we do not see anything obvious that would be preventing communication over that port (firewalls are off). When running
curl glf-science-0:8443
from the other nodes, we receive this message:
curl: (7) Failed to connect to glf-science-0 port 8443 after 6 ms: Connection refused
Here is the output from running ss -lnatup | grep 8443
on glf-science-0
tcp LISTEN 0 4096 127.0.1.1:8443 0.0.0.0:*
tcp ESTAB 0 0 142.130.6.94:35542 142.130.6.35:8443
tcp ESTAB 0 0 142.130.6.94:49964 142.130.6.142:8443
tcp ESTAB 0 0 142.130.6.94:36456 142.130.6.5:8443
tcp ESTAB 0 0 142.130.6.94:45314 142.130.6.35:8443
And here is from one of the other node members:
tcp LISTEN 0 4096 142.130.6.142:8443 0.0.0.0:*
tcp ESTAB 0 0 142.130.6.142:8443 142.130.6.5:47084
tcp ESTAB 0 0 142.130.6.142:35408 142.130.6.142:8443
tcp ESTAB 0 0 142.130.6.142:59474 142.130.6.35:8443
tcp ESTAB 0 0 142.130.6.142:59460 142.130.6.35:8443
tcp ESTAB 0 0 142.130.6.142:8443 142.130.6.142:35408
tcp ESTAB 0 0 142.130.6.142:59482 142.130.6.35:8443
tcp ESTAB 0 0 142.130.6.142:48014 142.130.6.5:8443
tcp ESTAB 0 0 142.130.6.142:8443 142.130.6.94:49964
tcp ESTAB 0 0 142.130.6.142:8443 142.130.6.35:36998
tcp ESTAB 0 0 142.130.6.142:8443 142.130.6.35:37004
Any leads would be appreciated. Thanks.