So now another server went down. And of course it was the Head of the cluster

Tony_Anytime · May 15, 2019, 2:33am

So now my cluster of 4 wont work properly because LXD is not running on one server. It is hanging like the other server was.

lxc list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

lxd sql local “SELECT * FROM raft_nodes;”
All server give this error for above command
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

I wish there was a way to uncluster the servers or get the lxc running without caring about the lxd cluster not working. OR a Safemode where the containers will run locally so that you can do an export at the very least.

HELP!, I need a solution, I can’t spend my life worrying a server rebooting and its LXD going to crap.

I think what is happening when the server reboots after a apt upgrade it is losing database
sqlite3 /var/snap/lxd/common/lxd/database/local.db .dump
Error: unable to open database “/var/snap/lxd/common/lxd/database/local.db”: unable to open database file

freeekanayaka · May 15, 2019, 7:09am

@Tony_Anytime with all good willing of the world (as clearly shown by @stgraber) we can’t help you if you keep posting confusing messages with little details and no clear explanation of what the problem is and how you got there.

Note that you are not paying for any kind of support, so anything you get is on best effort basis. It’d be nice if you could make our job easier instead of harder. Thanks.

Tony_Anytime · May 15, 2019, 12:14pm

I am giving lots of details. I do appreciate your help. And unfortunately this not the first time here with this problem it seems to be a problem with this version of LXD that I had 5 servers. I have been working on this problem since Friday, we are losing thousands of dollars a day, credibility, customers and I may lose my job. So pardon my lets get this fix attitude. But I do appreciate your help, programmer to programmer.

Because this has happen 3 times before. I know what happens. The server OS gets upgrade via apt upgrade and then when it gets reboot the LXD 3.0.3 fails to start. It seems to be a problem with Database losing sync or getting corrupted. And then LXD hangs, perhaps OS upgrade sqlite .

I am using this in a production environment. I have been using LXC since version 1. These are live containers. Last time like this the containers in the other machines are still running though I cannot access them.

Originally, the server MOE was the first to go down on Friday. I have all weekend trying to get it going. Finally I gave up on support from you guys, sorry I love you but you went there. So I erase LXD and restarted it. The server is running, there are some issues and not sure best way to get containers going short of copy data and reinstalling all programs on 10 containers. Thankfully most of these are low priority, I got the high priority running from a backup container in other server.

It seems that my cluster of 5 is no more redundant or fail-safe than a cluster of 4, 3. It is worse because one going down causes all of to go down to a certain extend.

Presently, I have two clusters, one with one machine MOE, and the original with 4. This is the one we are talking about now. I am still doing recovery on first one. I am having rebuild my container even though I have full backups.

Now for the problem at hand,
I have a cluster of 4
Larry, Curlyjoe, Joe, Chemp
Larry had a hardware problem and when it rebooted, it gets the same problem that Moe had on Friday.
LXD is stuck, and the other machines can not even do lxc list. Larry was the head of the cluster.

Everyone of the servers in this cluster gives on lxc list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

I would think first thing to do is to make another machine leader, and then erase db on larry and have it rsynch. I don’t know why this doesn’t happen already, I am running version 3.0.3 apt install on these server. Ubuntu 18.04

Do you need any other info? Let me know…

And your help is greatly appreciate because otherwise reinstalling everything from scratch is my only solution.

freeekanayaka · May 15, 2019, 2:01pm

The first observation is that if you are deploying LXD with apt (instead of snap), then you must ensure that you upgrade all servers more or less at the same time, because all nodes in the cluster need to run the same LXD version. If you upgrade only one server, reboot it, and leave the others with a different LXD version, that can be a problem.

All that being said, a cluster can survive the loss of one database node, not two. So if you lost 2 database nodes you’re stuck. It’s in our roadmap to improve this behavior and promote spare servers to database servers as needed, but we’re not there yet. The simplest option in your case is to try to bring back online at least one of the two servers you lost (either Larry or Moe).

Tony_Anytime · May 15, 2019, 2:17pm

MOE is gone, in another cluster now, in snap. So Larry is it. I believe that they are all in same version 3.0.3 since that has not changed in a while.
Curlyjoe should still have database. See from below when this first started happening.
Anyway to uncluster a server?

Tony_Anytime · May 15, 2019, 4:08pm

This is time sensitive, anyway to get a faster response on this.

freeekanayaka · May 15, 2019, 4:15pm

On larry, please try:

systemctl stop lxd
mv /var/lib/lxd/database/global  /var/lib/lxd/database/global.bak
systemctl start lxd

and restart curlyjoe as well:

systemctl restart lxd

that should hopefully recover larry by syncing from curlyjoe.

Tony_Anytime · May 15, 2019, 4:28pm

They both seem to hang. Screenshot%20from%202019-05-15%2012-33-21

lxd.service - LXD - main daemon
Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
Active: activating (start-post) since Wed 2019-05-15 12:35:20 EDT; 43s ago
Docs: man:lxd(1)
Process: 12648 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
Main PID: 13231 (lxd); Control PID: 13241 (lxd)
Tasks: 45
CGroup: /system.slice/lxd.service
├─13231 /usr/lib/lxd/lxd --group lxd --logfile=/var/log/lxd/lxd.log
└─13241 /usr/lib/lxd/lxd waitready --timeout=600

May 15 12:35:20 LARRY systemd[1]: Starting LXD - main daemon…
May 15 12:35:20 LARRY lxd[13231]: t=2019-05-15T12:35:20-0400 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will
May 15 12:35:23 LARRY lxd[13231]: t=2019-05-15T12:35:23-0400 lvl=warn msg=“Raft: no known peers, aborting election”
~
~

freeekanayaka · May 15, 2019, 6:12pm

And the log on curlyjoe?

Tony_Anytime · May 15, 2019, 6:22pm

lxd log seems fine on Curlyjoe

t=2019-05-15T13:57:16-0400 lvl=info msg=“Kernel uid/gid map:”
t=2019-05-15T13:57:16-0400 lvl=info msg=" - u 0 0 4294967295"
t=2019-05-15T13:57:16-0400 lvl=info msg=" - g 0 0 4294967295"
t=2019-05-15T13:57:16-0400 lvl=info msg=“Configured LXD uid/gid map:”
t=2019-05-15T13:57:16-0400 lvl=info msg=" - u 0 100000 65536"
t=2019-05-15T13:57:16-0400 lvl=info msg=" - g 0 100000 65536"
t=2019-05-15T13:57:16-0400 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
t=2019-05-15T13:57:16-0400 lvl=info msg=“Kernel features:”
t=2019-05-15T13:57:16-0400 lvl=info msg=" - netnsid-based network retrieval: no"
t=2019-05-15T13:57:16-0400 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-05-15T13:57:16-0400 lvl=info msg=“Initializing local database”
t=2019-05-15T14:07:16-0400 lvl=info msg=“LXD 3.0.3 is starting in normal mode” path=/var/lib/lxd
t=2019-05-15T14:07:16-0400 lvl=info msg=“Kernel uid/gid map:”
t=2019-05-15T14:07:16-0400 lvl=info msg=" - u 0 0 4294967295"
t=2019-05-15T14:07:16-0400 lvl=info msg=" - g 0 0 4294967295"
t=2019-05-15T14:07:16-0400 lvl=info msg=“Configured LXD uid/gid map:”
t=2019-05-15T14:07:16-0400 lvl=info msg=" - u 0 100000 65536"
t=2019-05-15T14:07:16-0400 lvl=info msg=" - g 0 100000 65536"
t=2019-05-15T14:07:16-0400 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
t=2019-05-15T14:07:16-0400 lvl=info msg=“Kernel features:”
t=2019-05-15T14:07:16-0400 lvl=info msg=" - netnsid-based network retrieval: no"
t=2019-05-15T14:07:16-0400 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-05-15T14:07:16-0400 lvl=info msg=“Initializing local database”
t=2019-05-15T14:17:16-0400 lvl=info msg=“LXD 3.0.3 is starting in normal mode” path=/var/lib/lxd
t=2019-05-15T14:17:16-0400 lvl=info msg=“Kernel uid/gid map:”
t=2019-05-15T14:17:16-0400 lvl=info msg=" - u 0 0 4294967295"
t=2019-05-15T14:17:16-0400 lvl=info msg=" - g 0 0 4294967295"
t=2019-05-15T14:17:16-0400 lvl=info msg=“Configured LXD uid/gid map:”
t=2019-05-15T14:17:16-0400 lvl=info msg=" - u 0 100000 65536"
t=2019-05-15T14:17:16-0400 lvl=info msg=" - g 0 100000 65536"
t=2019-05-15T14:17:16-0400 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
t=2019-05-15T14:17:16-0400 lvl=info msg=“Kernel features:”
t=2019-05-15T14:17:16-0400 lvl=info msg=" - netnsid-based network retrieval: no"
t=2019-05-15T14:17:16-0400 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-05-15T14:17:16-0400 lvl=info msg=“Initializing local database”

freeekanayaka · May 15, 2019, 6:24pm

What’s the output of:

echo "select * from raft_nodes" | sqlite3 /var/lib/lxd/database/local.db

on both larry and curlyjoe? (to be run as root).

Tony_Anytime · May 15, 2019, 6:25pm

root@CURLYJOE:/home/ic2000# echo “select * from raft_nodes” | sqlite3 /var/lib/lxd/database/local.db
1|64.71.77.29:8443
4|64.71.77.80:8443
5|64.71.77.13:8443

root@LARRY:/home/ic2000# echo “select * from raft_nodes” | sqlite3 /var/lib/lxd/database/local.db
1|64.71.77.29:8443
4|64.71.77.80:8443
5|64.71.77.13:8443

freeekanayaka · May 15, 2019, 6:25pm

That looks correct. Mmmh…

freeekanayaka · May 15, 2019, 6:28pm

Wait, actually the log of curlyjoe looks weird. It seems to be respawning…

Can you try to launch lxd by hand on both nodes?

systemctl stop lxd
lxd --verbose --debug

as root.

Tony_Anytime · May 15, 2019, 6:28pm

root@CHEMP:/home/ic2000# echo “select * from raft_nodes” | sqlite3 /var/lib/lxd/database/local.db
1|64.71.77.29:8443
4|64.71.77.80:8443
5|64.71.77.13:8443

oot@JOE:/home/ic2000# echo “select * from raft_nodes” | sqlite3 /var/lib/lxd/database/local.db
1|64.71.77.29:8443
4|64.71.77.80:8443
5|64.71.77.13:8443

Tony_Anytime · May 15, 2019, 6:30pm

root@LARRY:/home/ic2000# systemctl stop lxd
Warning: Stopping lxd.service, but it can still be activated by:
lxd.socket
root@LARRY:/home/ic2000# lxd --verbose --debug
DBUG[05-15|14:29:24] Connecting to a local LXD over a Unix socket
DBUG[05-15|14:29:25] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=

root@CURLYJOE:/home/ic2000# lxd --verbose --debug
DBUG[05-15|14:29:27] Connecting to a local LXD over a Unix socket
DBUG[05-15|14:29:27] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=

Tony_Anytime · May 15, 2019, 6:31pm

And there they hang

freeekanayaka · May 15, 2019, 6:31pm

Please make sure that there is not another lxd process running, and try again.

Also, please double check that:

which lxd

points to /usr/bin/lxd.

freeekanayaka · May 15, 2019, 6:38pm

And ps aux | grep lxd?

freeekanayaka · May 15, 2019, 6:40pm

The output that you pasted seems to indicate that you ran lxc --verbose --debug and not lxd --verbose --debug.