Lxd fails to connect to global database

aaa · February 12, 2020, 10:26pm

I’ve been playing with MAAS and LXD clustering. I’m able to provision new hosts which add themselves to an existing LXD cluster. With a 2 host cluster after releasing 1 host in MAAS, lxc cluster list would show 1 node missing. lxc cluster remove --force would clear this up and I could reprovision. With a 3 host cluster after releasing 2 hosts, lxc cluster list hangs indefinitely.
After restarting LXD, lxd.log shows:

t=2020-02-12T21:02:18+0000 lvl=info msg=“LXD 3.20 is starting in normal mode” path=/var/snap/lxd/common/lxd
t=2020-02-12T21:02:18+0000 lvl=info msg=“Kernel uid/gid map:”
t=2020-02-12T21:02:18+0000 lvl=info msg=" - u 0 0 4294967295"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - g 0 0 4294967295"
t=2020-02-12T21:02:18+0000 lvl=info msg=“Configured LXD uid/gid map:”
t=2020-02-12T21:02:18+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-02-12T21:02:18+0000 lvl=info msg=“Kernel features:”
t=2020-02-12T21:02:18+0000 lvl=info msg=" - netnsid-based network retrieval: no"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - uevent injection: no"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - seccomp listener: no"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - seccomp listener continue syscalls: no"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - cgroup layout: hybrid"
t=2020-02-12T21:02:18+0000 lvl=warn msg=" - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-02-12T21:02:18+0000 lvl=info msg=" - shiftfs support: disabled"
t=2020-02-12T21:02:18+0000 lvl=info msg=“Initializing local database”
t=2020-02-12T21:02:18+0000 lvl=info msg=“Starting /dev/lxd handler:”
t=2020-02-12T21:02:18+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-02-12T21:02:18+0000 lvl=info msg=“REST API daemon:”
t=2020-02-12T21:02:18+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-02-12T21:02:18+0000 lvl=info msg=" - binding TCP socket" socket=10.41.0.3:8443
t=2020-02-12T21:02:18+0000 lvl=info msg=“Initializing global database”
t=2020-02-12T21:03:45+0000 lvl=warn msg=“Failed connecting to global database (attempt 6): failed to create dqlite connection: no available dqlite leader server found”
t=2020-02-12T21:03:58+0000 lvl=warn msg=“Failed connecting to global database (attempt 7): failed to create dqlite connection: no available dqlite leader server found”

lxd sql local “select * from raft_nodes” shows

±—±-----------------±-----+
| id | address | role |
±—±-----------------±-----+
| 1 | 10.41.0.3:8443 | 0 |
| 3 | 10.41.0.191:8443 | 0 |
| 4 | :8443 | 2 |
±—±-----------------±-----+
Id 1 is the main cluster node and Id 3 is the released MAAS node. Not sure what id 4 is. Not quite sure how to go about cleaning up these tables so could use some help or other suggestions.

stgraber · February 13, 2020, 12:03am

:8443 as a raft address is definitely going to cause problems, not sure how that managed to make it in there without triggering an error much earlier! @freeekanayaka

In a 3 nodes cluster, you need two nodes to be online for the database to be functional.
If you just released two of the three nodes, then your cluster is now broken and cannot really be fixed other than using disaster recovery commands like lxd cluster recover-from-quorum-loss.

You should ALWAYS cleanly remove nodes through LXD with lxc cluster remove BEFORE taking the system offline. The lxc cluster remove --force command is only meant to handle hardware failure cases.

A LXD cluster with less than 3 configured nodes has a single database server and a quorum of one. If that server goes away, your cluster is done.

A LXD cluster with 3 or more configured nodes has three database servers and a quorum of two. If you have more node, they will get promoted as nodes go offline for additional stability. But if at any point your cluster goes below 2 active database nodes, your database will become offline until you have two active database nodes again.

aaa · February 13, 2020, 12:39am

Thanks for the quick reply. I’ll check my notes and see if I can reproduce steps done to get :8443 in the raft address.

Tried running lxd cluster recover-from-quorum-loss but it appears just to hang(running for about 2-3min).

‘Always remove nodes cleanly…’ noted. Not sure of the correct way to reattach a node to a cluster after it was released(without removing it cleanly). I would get an error stating the node already existed when it was redeployed.

Currently, LXD is not totally up since I’m unable to do anything with the global database. lxd sql global .schema returns:

Error: failed to request dump: Get http://unix.socket/internal/sql?database=global&schema=1: EOF

and lxc list just hangs.

Is the way forward to fix the database or something else? I’d like to get LXD up so I can get to a container on this host.

stgraber · February 13, 2020, 3:29am

https://linuxcontainers.org/lxd/docs/master/clustering#disaster-recovery may be of help.

It specifically requests that the LXD daemon be completely turned off on the machine this is run on, so it can edit the database offline after which it can be brought back up. I wonder if that was the issue for you.

freeekanayaka · February 13, 2020, 8:57am

It seems that we do allow :8443 as value for the node address when running lxd init. I will add additional validation, since it’s clearly going to break as soon as you have more than one node.

stgraber · February 13, 2020, 1:47pm

We could have lxd init ask for both cluster.listen_address and core.listen_address (defaulting to same value). It’s fine to use :8443 for core.listen_address, but cluster.listen_address must be a specific IP (and cannot be changed later on).

aaa · February 14, 2020, 1:18am

Thanks Stephane. I need to read the docs more closely. Instructions worked as stated. lxd sql local “select * from raft_nodes” now show

±—±---------------±-----+
| id | address | role |
±—±---------------±-----+
| 1 | 10.41.0.3:8443 | 0 |
±—±---------------±-----+

I found the cause of getting into this wonky state. I was provisioning 2 nodes with MAAS using curtin to instruct them to attach to the LXD cluster. I do some sed magic to alter a preseed file which is then passed to lxd init. Basically, sed needs to add the node’s hostname and IP to the preseed file and the IP turned out to be empty due to a bug on my side. Thus the server_address in the preseed file for one of the nodes turned out to be:

server_address: :8443

and as you stated, LXD is not happy with that.