LXD is confused about clustering state

Hi,

I’ve set up LXD clustering on an existing LXD node today with lxc cluster enable <name>, and then added a fresh node to it. However after rebooting all nodes something strange seems to have happened, leaving LXD in a partial state:

tobias@artemis:~$ lxc cluster list
Error: Server is not clustered
tobias@artemis:~$ lxc cluster enable artemis
Error: This LXD server is already clustered

I’m not entirely sure what is going on here, or how to get the cluster back up and running, hopefully someone here does.

Environment

  • LXD 4.22
  • uname -a: Linux artemis 5.4.0-97-generic #110-Ubuntu SMP Thu Jan 13 18:22:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Thanks!

Can you show lxd sql global "SELECT * FROM nodes" and lxd sql local "SELECT * FROM raft_nodes"?

Of course. Here’s the first command:

+----+-----------+-------------+----------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| id |   name    | description |          address           | schema | api_extensions |           heartbeat            | state | arch | failure_domain_id |
+----+-----------+-------------+----------------------------+--------+----------------+--------------------------------+-------+------+-------------------+
| 1  | artemis   |             | artemis.array21.dev:8443   | 59     | 285            | 2022-02-07T18:59:25.831071209Z | 0     | 2    | <nil>             |
| 2  | aphrodite |             | aphrodite.array21.dev:8443 | 59     | 285            | 2022-02-07T18:59:26.595791422Z | 0     | 2    | <nil>             |

and the second:

+----+----------------------------+------+-----------+
| id |          address           | role |   name    |
+----+----------------------------+------+-----------+
| 1  | artemis.array21.dev:8443   | 0    | artemis   |
| 2  | aphrodite.array21.dev:8443 | 1    | aphrodite |
+----+----------------------------+------+-----------+

Hmm, can you try running systemctl reload snap.lxd.daemon to have LXD restart and see if that helps? If not, look at /var/snap/lxd/common/lxd/logs/lxd.log for some relevant errors.

Looking in lxd.log did give something which might cause issues:

t=2022-02-07T19:01:14+0000 lvl=eror msg="Invalid configuration key: Couldn't resolve \"artemis.array21.dev\"" key=cluster.https_address
t=2022-02-07T19:01:14+0000 lvl=eror msg="Invalid configuration key: Couldn't resolve \"artemis.array21.dev\"" key=core.https_address
t=2022-02-07T19:01:14+0000 lvl=info msg="Starting database node" id=1 local=1 role=voter
t=2022-02-07T19:01:28+0000 lvl=eror msg="Invalid configuration key: Couldn't resolve \"artemis.array21.dev\"" key=core.https_address

Something that seems odd to me is the extra " appended at the end of the FQDN, though that might just be the logging. Running dig on the address returns the correct IP address too. I’m unsure on how to change it to the IP itself, to try and see if that helps any (10.10.2.4), as I’m unable to change the address, with LXD telling me it is not supported to change the address

What does getent hosts artemis.array21.dev get you?

Changing to the IP is a bit tricky but could be done by shutting down LXD on both servers and using lxd cluster edit on both of them.

An alternative would be to use /etc/hosts to put a more static value for those.

getent returns

10.10.2.4       artemis.array21.dev

I’ll give the shutdown and edit a try in a couple of hours when there’s less load, I’ll make a follow up comment describing my findings.

After fiddling with Systemd for too much trying to get the LXD daemon to stay quiet, I’ve been able to get it working. Changing artemis.array21.dev to 10.10.2.2 made LXD realize it is in a cluster again.

Do you want me to open a bug report for this?

Maybe but I’m not yet sure where the bug is, LXD was trying to resolve artemis.array21.dev which seems correct but that was failing for some reason.

It’s maybe snap related, in which case, can you try:
nsenter --mount=/run/snapd/ns/lxd.mnt getent hosts artemis.array21.dev and see if that resolves properly too?