I am giving lots of details. I do appreciate your help. And unfortunately this not the first time here with this problem it seems to be a problem with this version of LXD that I had 5 servers. I have been working on this problem since Friday, we are losing thousands of dollars a day, credibility, customers and I may lose my job. So pardon my lets get this fix attitude. But I do appreciate your help, programmer to programmer.
Because this has happen 3 times before. I know what happens. The server OS gets upgrade via apt upgrade and then when it gets reboot the LXD 3.0.3 fails to start. It seems to be a problem with Database losing sync or getting corrupted. And then LXD hangs, perhaps OS upgrade sqlite .
I am using this in a production environment. I have been using LXC since version 1. These are live containers. Last time like this the containers in the other machines are still running though I cannot access them.
Originally, the server MOE was the first to go down on Friday. I have all weekend trying to get it going. Finally I gave up on support from you guys, sorry I love you but you went there. So I erase LXD and restarted it. The server is running, there are some issues and not sure best way to get containers going short of copy data and reinstalling all programs on 10 containers. Thankfully most of these are low priority, I got the high priority running from a backup container in other server.
It seems that my cluster of 5 is no more redundant or fail-safe than a cluster of 4, 3. It is worse because one going down causes all of to go down to a certain extend.
Presently, I have two clusters, one with one machine MOE, and the original with 4. This is the one we are talking about now. I am still doing recovery on first one. I am having rebuild my container even though I have full backups.
Now for the problem at hand,
I have a cluster of 4
Larry, Curlyjoe, Joe, Chemp
Larry had a hardware problem and when it rebooted, it gets the same problem that Moe had on Friday.
LXD is stuck, and the other machines can not even do lxc list. Larry was the head of the cluster.
Everyone of the servers in this cluster gives on lxc list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found
I would think first thing to do is to make another machine leader, and then erase db on larry and have it rsynch. I don’t know why this doesn’t happen already, I am running version 3.0.3 apt install on these server. Ubuntu 18.04
Do you need any other info? Let me know…
And your help is greatly appreciate because otherwise reinstalling everything from scratch is my only solution.