Asymmetrical communication and "no heartbeat" for LXD cluster node member

davjfish · June 23, 2023, 12:29pm

We have an LXD cluster set up with 4 member nodes.

lxc cluster ls
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
|     NAME      |                    URL                    |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                   MESSAGE                                    |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-0 | https://glf-science-0.ent.dfo-mpo.ca:8443 | database-standby | x86_64       | default        |             | OFFLINE | No heartbeat for 7h40m7.464691558s (2023-06-23 04:17:02.595944552 +0000 UTC) |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-1 | https://glf-science-1.ENT.dfo-mpo.ca:8443 | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                            |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-2 | https://glf-science-2.ent.dfo-mpo.ca:8443 | database         | x86_64       | default        |             | ONLINE  | Fully operational                                                            |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+
| glf-science-3 | https://glf-science-3.ENT.dfo-mpo.ca:8443 | database-leader  | x86_64       | default        |             | ONLINE  | Fully operational                                                            |
|               |                                           | database         |              |                |             |         |                                                                              |
+---------------+-------------------------------------------+------------------+--------------+----------------+-------------+---------+------------------------------------------------------------------------------+

One of the nodes is giving us difficulty, and we are looking for some help troubleshooting. All four members are running Ubuntu 22.04 and the same LXD version (5.14)

Here is a description of the problematic behaviour:

No problems encountered when joining first three members. When attempting to bring in glf-science-0, the process stalled several times on the lxd init. After each attempt to join, there the cluster’s database leader would have a raft node corresponding to the glf-science-0 address, however it would not show up in the lxc cluster list from anywhere within cluster.
After removing bad raft node and rerunning the lxd init on a clean install, glf-science-0 successfully made it into the cluster. When checking the lxc cluster list, all four nodes were online but only for a few seconds. Since then, glf-science-0 has been offline.

Communication between the four members is asymmetrical. For example, running lxc list from within glf-science-0 results in the following:

+--------------------+---------+----------------------+------+-----------+-----------+---------------+
|        NAME        |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |   LOCATION    |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| andes-preprod-east | RUNNING | 240.6.142.86 (eth0)  |      | CONTAINER | 2         | glf-science-1 |
|                    |         | 142.130.6.217 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| dmapps             | RUNNING | 240.4.21.42 (eth0)   |      | CONTAINER | 1         | glf-science-0 |
|                    |         | 142.130.4.27 (eth1)  |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| gulf-fisheriescape | RUNNING | 240.6.35.204 (eth0)  |      | CONTAINER | 0         | glf-science-3 |
|                    |         | 142.130.6.173 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| r-shiny-science    | RUNNING | 240.6.142.141 (eth0) |      | CONTAINER | 0         | glf-science-1 |
|                    |         | 142.130.6.230 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+

While running it from any of the other members results in this:

+--------------------+---------+----------------------+------+-----------+-----------+---------------+
|        NAME        |  STATE  |         IPV4         | IPV6 |   TYPE    | SNAPSHOTS |   LOCATION    |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| andes-preprod-east | RUNNING | 240.6.142.86 (eth0)  |      | CONTAINER | 2         | glf-science-1 |
|                    |         | 142.130.6.217 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| dmapps             | ERROR   |                      |      | CONTAINER | 0         | glf-science-0 |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| gulf-fisheriescape | RUNNING | 240.6.35.204 (eth0)  |      | CONTAINER | 0         | glf-science-3 |
|                    |         | 142.130.6.173 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+
| r-shiny-science    | RUNNING | 240.6.142.141 (eth0) |      | CONTAINER | 0         | glf-science-1 |
|                    |         | 142.130.6.230 (eth1) |      |           |           |               |
+--------------------+---------+----------------------+------+-----------+-----------+---------------+

The container named dmapps was created on the glf-science-0 node and while the other nodes seem to know of its existence, they cannot seem to reach it via lxd API. The container address (i.e., 240.4.21.42) is pingable from all four cluster nodes.

We are suspecting this has to do with accessibility of the glf-science-0 machine on port 8443 however we do not see anything obvious that would be preventing communication over that port (firewalls are off). When running
curl glf-science-0:8443 from the other nodes, we receive this message:

curl: (7) Failed to connect to glf-science-0 port 8443 after 6 ms: Connection refused

Here is the output from running ss -lnatup | grep 8443 on glf-science-0

tcp   LISTEN    0      4096                127.0.1.1:8443         0.0.0.0:*
tcp   ESTAB     0      0                142.130.6.94:35542   142.130.6.35:8443
tcp   ESTAB     0      0                142.130.6.94:49964  142.130.6.142:8443
tcp   ESTAB     0      0                142.130.6.94:36456    142.130.6.5:8443
tcp   ESTAB     0      0                142.130.6.94:45314   142.130.6.35:8443

And here is from one of the other node members:

tcp   LISTEN    0      4096        142.130.6.142:8443         0.0.0.0:*
tcp   ESTAB     0      0           142.130.6.142:8443     142.130.6.5:47084
tcp   ESTAB     0      0           142.130.6.142:35408  142.130.6.142:8443
tcp   ESTAB     0      0           142.130.6.142:59474   142.130.6.35:8443
tcp   ESTAB     0      0           142.130.6.142:59460   142.130.6.35:8443
tcp   ESTAB     0      0           142.130.6.142:8443   142.130.6.142:35408
tcp   ESTAB     0      0           142.130.6.142:59482   142.130.6.35:8443
tcp   ESTAB     0      0           142.130.6.142:48014    142.130.6.5:8443
tcp   ESTAB     0      0           142.130.6.142:8443    142.130.6.94:49964
tcp   ESTAB     0      0           142.130.6.142:8443    142.130.6.35:36998
tcp   ESTAB     0      0           142.130.6.142:8443    142.130.6.35:37004

Any leads would be appreciated. Thanks.

tomp · June 23, 2023, 12:59pm

You’ve cleared the firewall on each host and checked network level filtering policies?

davjfish · June 23, 2023, 1:05pm

Yes, but actually we just figured out what was going on.

When figuring out which IP to listen on, LXD was getting caught on ubuntu’s local dns (i.e., 127.0.1.1) because the domain name was for some unknown reason listed in /etc/hosts.

Removing this entry in the /etc/hosts file allowed glf-science-0 to be online.