When turning off the first LXD cluster node: no available dqlite leader server found

TomvB · February 18, 2019, 6:37am

Hi,

When I reboot/turn off LXD host 1, the cluster is reporting:
lxc cluster list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

How can I set a leader (LXD host 3?) What I also noticed, host 2 has no database. This is normal?

+-------+----------------------------+----------+--------+-------------------+
| NAME  |            URL             | DATABASE | STATE  |      MESSAGE      |
+-------+----------------------------+----------+--------+-------------------+
| host-1 | https://192.168.100.1:8443 | YES      | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+
| host-2 | https://192.168.100.2:8443 | NO       | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+
| host-3 | https://192.168.100.3:8443 | YES      | ONLINE | fully operational |
+-------+----------------------------+----------+--------+-------------------+

LXC Cluster show:
host-2:
user@host-2: lxc cluster show host-2
server_name: host-2
url: https://192.168.100.2:8443
database: false
status: Online

user@host-3: lxc cluster show host-3
server_name: host-3
url: https://192.168.100.3:8443
database: true
status: Online
message: fully operational

user@host-1: lxc cluster show host-1
server_name: host-1
url: https://192.168.100.1:8443
database: true
status: Online
message: fully operational

Host config for example:
user@host-3:~$ lxc info
config:
cluster.https_address: 192.168.100.3:8443
core.https_address: 192.168.100.3:8443
core.trust_password: true

user@host-1:~$ lxc info
config:
cluster.https_address: 192.168.100.1:8443
core.https_address: 192.168.100.1:8443
core.trust_password: true

user@host-2:~$ lxc info
config:
cluster.https_address: 192.168.100.2:8443
core.https_address: 192.168.100.2:8443
core.trust_password: true

freeekanayaka · February 18, 2019, 8:43am

What version of LXD are you using?

The fact that host-2 is not a database node is not normal, and might be a bug.

Usually with 3 nodes you have 3 database nodes, so if you reboot/shutdown one of the three nodes, the other two remain fully operational. In your case you have only 2 database nodes, so turning off one node makes the cluster unavailable.

TomvB · February 18, 2019, 8:48am

On all nodes snap:
lxc --version
3.10
lxd --version
3.10

When host-1 is offline:
DBUG[02-18|09:48:50] Start database node id=3 address=192.168.100.3:8443
EROR[02-18|09:48:55] Failed to start the daemon: Failed to create raft factory: failed to create bolt store for raft logs: timeout
INFO[02-18|09:48:55] Starting shutdown sequence
DBUG[02-18|09:48:55] Not unmounting temporary filesystems (containers are still running)
INFO[02-18|09:48:55] Saving simplestreams cache
INFO[02-18|09:48:55] Saved simplestreams cache
Error: Failed to create raft factory: failed to create bolt store for raft logs: timeout

freeekanayaka · February 18, 2019, 8:52am

So it looks like a bug.

Do you remember how you built this cluster? I would expect that you added and removed some nodes at some point. It would be useful to know the detail lifecycle of the cluster, wrt node added and node removed.

TomvB · February 18, 2019, 8:58am

Yes,

lxd init on the host-1 as first one in the cluster with the default instructions. (LXD 3.1)
Reinstalled host-2 yesterday (with --force removal)
host-3 untouched since the cluster join.

apt-get update -y
apt-get upgrade -y
adduser user
apt remove --purge lxd lxd-client
groupadd --system lxd
usermod -G lxd -a user
snap install lxd
apt install zfsutils-linux

lxd init as sudo user 
lxc remote add host-1 192.168.100.1
lxc remote add host-2 192.168.100.2
lxc remote add host-3 192.168.100.3

Nothing special.

freeekanayaka · February 18, 2019, 9:02am

I’m confused by the fact that you mention both apt and snap. Are you using apt or snap? Also, was this cluster always at version 3.10 or did you upgrade from 3.0.x?

TomvB · February 18, 2019, 9:03am

Oh, sorry, Snap only.
snap install lxd

I have always used snap, cluster is not that old yet. (3.1+) with auto refresh/updates enabled. (Stable Branch)

freeekanayaka · February 18, 2019, 9:52am

I can’t reproduce this problem. What I did:

Configure node1 as first node of the cluster
Add node2
Add node3
Run lxc cluster remove node2 --force
Wipe node2 data
Join node2 again

After this lxc cluster list shows that all nodes are database nodes.

If you don’t have containers that you care about on host-2, the simplest solution would be to remove host-2, wipe it and join it again. Not sure what went wrong the first time.

Otherwise, if you have containers on node-2 that you want to preserve, we’ll need to figure out some manual repairing, but that might be tricky.

TomvB · February 18, 2019, 9:59am

I’ll try a reinstallation on host-2 today. I’ll let you know if it works.
Strange… I will move the containers from host-2 to host-1.

Ok, it works now… Still strange.

lxd init with root instead of user with ‘sudo lxd init’. Could this make a difference?