I am pretty new in this forum (managed to find my way with lxc and lxd without becoming a member before), but I am afraid I managed to build some real trouble now, for which I definitely need some help now…
I have a lxd cluster consisting of two physical nodes (I know, that is never enough), to get quorum I have one virtual machine on both of them just to have more than two nodes. I know this set-up is wrong, but I figured as long as I have the database on the two physical machines and on the virtual machines I can always get quorum. I now found I out that I was awfully wrong!
I had a power outage and both physical machines went down. I found out that two of the three databases are on the virtual machines and only one is on one of the two physical machines. Obviously the cluster did not come up and I am also unable to do anything on either one of the physical nodes (both lxc and lxd commands won’t react, accept for “lxd cluster list-database”) So obviously I can also not start one of the VMs containing the other database instances to regain quorum. Leaving me only with one database and a cluster that won’t start. I really start to wonder what I was thinking when I made this setup, but it’s for personal use and I lacked the physical hardware for three physical nodes.
I found out about the “lxd cluster recover-from-quorum-loss” command and tried to run that on the one physical machine having an instance of the cluster database. That however does not appear to be working. The command simply hangs. I read somewhere I had to make sure to shutdown the LXD daemon, so I did “snap stop lxd.daemon --no-wait --disable” and thereafter “lxd cluster recover-from-quorum-loss”. Nothing happened. I than ran the command using --debug option and now get the following output:
DBUG[09-02][05:28:30] Connecting to a local LXD over a Unix socket
DBUG[09-02][05:28:30] Sending request to LXD method = GET url=http://unix.socket/1.0 etag =
Thereafter no reaction whatsoever. I think trying to recover from quorum loss is the way to go for me, because if I get this one physical node up and running I can start the VM on that one and regain quorum. But what should I do to get it online? It would even be okay for me to entirely remove it out of the cluster, as long as I can get the containers and VMs that are running on it back online.
I hope any of you can help me out. And please, if you can: I have fair knowledge, but also sometimes lack some of it, so please be precise in the steps to take and how to take them
Thanks so much in advance!