When turning off the first LXD cluster node: no available dqlite leader server found

Hi,

When I reboot/turn off LXD host 1, the cluster is reporting:
lxc cluster list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

How can I set a leader (LXD host 3?) What I also noticed, host 2 has no database. This is normal?

+-------+----------------------------+----------+--------+-------------------+
| NAME  |            URL             | DATABASE | STATE  |      MESSAGE      |
+-------+----------------------------+----------+--------+-------------------+
| host-1 | https://192.168.100.1:8443 | YES      | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+
| host-2 | https://192.168.100.2:8443 | NO       | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+
| host-3 | https://192.168.100.3:8443 | YES      | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+

LXC Cluster show:
host-2:
user@host-2: lxc cluster show host-2
server_name: host-2
url: https://192.168.100.2:8443
database: false
status: Online

user@host-3: lxc cluster show host-3
server_name: host-3
url: https://192.168.100.3:8443
database: true
status: Online
message: fully operational

user@host-1: lxc cluster show host-1
server_name: host-1
url: https://192.168.100.1:8443
database: true
status: Online
message: fully operational

Host config for example:
user@host-3:~$ lxc info
config:
cluster.https_address: 192.168.100.3:8443
core.https_address: 192.168.100.3:8443
core.trust_password: true

user@host-1:~$ lxc info
config:
cluster.https_address: 192.168.100.1:8443
core.https_address: 192.168.100.1:8443
core.trust_password: true

user@host-2:~$ lxc info
config:
cluster.https_address: 192.168.100.2:8443
core.https_address: 192.168.100.2:8443
core.trust_password: true

What version of LXD are you using?

The fact that host-2 is not a database node is not normal, and might be a bug.

Usually with 3 nodes you have 3 database nodes, so if you reboot/shutdown one of the three nodes, the other two remain fully operational. In your case you have only 2 database nodes, so turning off one node makes the cluster unavailable.

On all nodes snap:
lxc --version
3.10
lxd --version
3.10

When host-1 is offline:
DBUG[02-18|09:48:50] Start database node id=3 address=192.168.100.3:8443
EROR[02-18|09:48:55] Failed to start the daemon: Failed to create raft factory: failed to create bolt store for raft logs: timeout
INFO[02-18|09:48:55] Starting shutdown sequence
DBUG[02-18|09:48:55] Not unmounting temporary filesystems (containers are still running)
INFO[02-18|09:48:55] Saving simplestreams cache
INFO[02-18|09:48:55] Saved simplestreams cache
Error: Failed to create raft factory: failed to create bolt store for raft logs: timeout

So it looks like a bug.

Do you remember how you built this cluster? I would expect that you added and removed some nodes at some point. It would be useful to know the detail lifecycle of the cluster, wrt node added and node removed.

Yes,

lxd init on the host-1 as first one in the cluster with the default instructions. (LXD 3.1)
Reinstalled host-2 yesterday (with --force removal)
host-3 untouched since the cluster join.

apt-get update -y
apt-get upgrade -y
adduser user
apt remove --purge lxd lxd-client
groupadd --system lxd
usermod -G lxd -a user
snap install lxd
apt install zfsutils-linux

lxd init as sudo user 
lxc remote add host-1 192.168.100.1
lxc remote add host-2 192.168.100.2
lxc remote add host-3 192.168.100.3

Nothing special.

I’m confused by the fact that you mention both apt and snap. Are you using apt or snap? Also, was this cluster always at version 3.10 or did you upgrade from 3.0.x?

Oh, sorry, Snap only.
snap install lxd

I have always used snap, cluster is not that old yet. (3.1+) with auto refresh/updates enabled. (Stable Branch)

I can’t reproduce this problem. What I did:

  1. Configure node1 as first node of the cluster
  2. Add node2
  3. Add node3
  4. Run lxc cluster remove node2 --force
  5. Wipe node2 data
  6. Join node2 again

After this lxc cluster list shows that all nodes are database nodes.

If you don’t have containers that you care about on host-2, the simplest solution would be to remove host-2, wipe it and join it again. Not sure what went wrong the first time.

Otherwise, if you have containers on node-2 that you want to preserve, we’ll need to figure out some manual repairing, but that might be tricky.

I’ll try a reinstallation on host-2 today. I’ll let you know if it works.
Strange… I will move the containers from host-2 to host-1.

Ok, it works now… Still strange.

lxd init with root instead of user with ‘sudo lxd init’. Could this make a difference?