Unable to do a recover-from-quorum-loss

mmaanen · September 2, 2022, 5:32am

Hi,

I am pretty new in this forum (managed to find my way with lxc and lxd without becoming a member before), but I am afraid I managed to build some real trouble now, for which I definitely need some help now…

I have a lxd cluster consisting of two physical nodes (I know, that is never enough), to get quorum I have one virtual machine on both of them just to have more than two nodes. I know this set-up is wrong, but I figured as long as I have the database on the two physical machines and on the virtual machines I can always get quorum. I now found I out that I was awfully wrong!

I had a power outage and both physical machines went down. I found out that two of the three databases are on the virtual machines and only one is on one of the two physical machines. Obviously the cluster did not come up and I am also unable to do anything on either one of the physical nodes (both lxc and lxd commands won’t react, accept for “lxd cluster list-database”) So obviously I can also not start one of the VMs containing the other database instances to regain quorum. Leaving me only with one database and a cluster that won’t start. I really start to wonder what I was thinking when I made this setup, but it’s for personal use and I lacked the physical hardware for three physical nodes.

I found out about the “lxd cluster recover-from-quorum-loss” command and tried to run that on the one physical machine having an instance of the cluster database. That however does not appear to be working. The command simply hangs. I read somewhere I had to make sure to shutdown the LXD daemon, so I did “snap stop lxd.daemon --no-wait --disable” and thereafter “lxd cluster recover-from-quorum-loss”. Nothing happened. I than ran the command using --debug option and now get the following output:

DBUG[09-02][05:28:30] Connecting to a local LXD over a Unix socket
DBUG[09-02][05:28:30] Sending request to LXD method = GET url=http://unix.socket/1.0 etag =

Thereafter no reaction whatsoever. I think trying to recover from quorum loss is the way to go for me, because if I get this one physical node up and running I can start the VM on that one and regain quorum. But what should I do to get it online? It would even be okay for me to entirely remove it out of the cluster, as long as I can get the containers and VMs that are running on it back online.

I hope any of you can help me out. And please, if you can: I have fair knowledge, but also sometimes lack some of it, so please be precise in the steps to take and how to take them

Thanks so much in advance!

Kind regards,
Martijn

tomp · September 2, 2022, 7:27am

Hi @mmaanen

So I understand the situation fully, can you describe in more detail the LXD cluster layout?

How many cluster members did you have in total?
How many were running on VMs and physical machines?
Can you not start some/all of the VMS or physical machines, or are they all running now but won’t initialise the LXD cluster?

mmaanen · September 2, 2022, 7:36am

Hello Tomp,

First of all, thank you so much for your swift reply!

I have two physical nodes running lxd, let’s say node 1 and 2. On node 1 I had a VM, let’s call this one node A and on node 2 I had a VM, let’s call that one node B.

I have access to both 1 and 2. However the database is distributed among 1, B and C. So the cluster will not come online, because 1 and 2 don’t have sufficient databases without either B or C. But that leaves in a sort of “deadlock” because as long as the cluster is not up, I can’t start VM B or C, to get at least two databases.

Does this better describe my situation?

So basically, what I need is to get at least one of the nodes 1 or 2 to start VM A or B and than I think I get it running from there.

mmaanen · September 2, 2022, 7:40am

It would even help if there is some sort of procedure to remove either A or B from the cluster and get it to work as a stand alone server, so than at least I can start using my containers and VMs and then afterwards consider if I want to rebuild a cluster setup.

tomp · September 2, 2022, 7:44am

I think you missed describing what node C was?

Are these VMs LXD VMs? So I think you’re saying because you are using LXD VMs you cant start them because they are part of the hosting cluster?

tomp · September 2, 2022, 7:46am

Can you show output of lxd cluster list-database please

tomp · September 2, 2022, 7:47am

Also to stop LXD on a member try this:

systemctl stop snap.lxd.daemon.service snap.lxd.daemon.unix.socket

and make sure ps aux | grep lxd doesnt show any LXD processes running.

mmaanen · September 2, 2022, 9:07am

Hi tomp,

thanks for your answers! See I even messed up my post describing my situation. I talked about B and C, where I should be talking about A and B. So the description should be:

I have two physical nodes running lxd, let’s say node 1 and 2. On node 1 I had a VM, let’s call this one node A and on node 2 I had a VM, let’s call that one node B.

I have access to both 1 and 2. However the database is distributed among 1, A and B. So the cluster will not come online, because 1 and 2 don’t have sufficient databases without either A or B. But that leaves me in a sort of “deadlock” because as long as the cluster is not up, I can’t start VM A or B, to get at least two databases.

Your conclusion is correct though that because I am using LXD VMs which I can’t start because they are part of the hosting clusterup, I can’t start VM A or B, to get at least two databases.

Having said that the suggestion to stop lxd daemon worked and I am now able to run “lxd cluster recover-from-quorum-loss”, which I think is what I need to do in this situation, but before I confirm this by typing yes on the question “Do you want to proceed? (yes/no)” I will think over the consequences.

And an additional question to you: can I also run this on node 2, so the one physical node which does NOT have a database to also regain control of that node?

Thanks again for your kind answers!

tomp · September 2, 2022, 9:11am

I’m assuming you’ve seen this, but if not it may be useful:

tomp · September 2, 2022, 9:14am

Once you’ve run it on one member, you should then be able to remove the VM member using lxc cluster remove <name> --force at which point the cluster will become a 2 member cluster (which is allowed but does’t provide HA). At this point I think you should then be able to start up the other physical cluster member and it should rejoin.

Any thoughts @mbordere ?

mbordere · September 2, 2022, 9:37am

@tomp If I understand the code then recover-from-quorum-loss will force a configuration into raft containing the single node on which the recovery has been run. It is then up to LXD to add the other members to the new cluster, that now only contains that single member. I’m not sure / I don’t know if LXD does that.

mmaanen · September 2, 2022, 9:39am

Dear tomp,
Thank you so much, for your explanation and your input. I got it up and running again. Basically what I had was a non-HA cluster anyway, but it depended on the VMs. That was not the brightest setup, to say the least…

Having two nodes, which are not HA suits my use case. So this worked out fine for me!

Kind regards,
Martijn