Resolved- LXD Cluster dying - Failed to fetch http://unix.socket/1.0: 500 Internal Server Error and my files are missing

Tony_Anytime · September 23, 2018, 2:46pm

I have three servers in a cluster, moe, larry, curly in production. They have been working fine, but after a reboot curly lxc wont work. lxc list never finishes. lxc list on the other two gives Failed to fetch http://unix.socket/1.0: 500 Internal Server Error. Which basically means I have three servers with hundreds of websites and applications at risk of being lost.

I don’t understand why if one server goes down, it all stops working. What is the point. Do I need a fourth server?

So I would normally, blow away the bad server and reinstall from backup. The weird part is that the backup files in both a separate drive and the ones I copy it is too are also not showing. It is like the Zpool is corrupt, but why would that affect file outside of it. I know this doesn’t not make sense. I copied /var/lib/lxd to a separate drive and they are their but files in container is missing.

Anyway, so i need to either restart LXD cluster getting Curly back online or somehow get data out of it and reestablish it again.

Any and all help is welcomed, this basically ruined my Sunday and perhaps the rest of the week. And I may have to shutdown cluster idea if I can’t figure this out and go back to individual lxc.

l

freeekanayaka · September 23, 2018, 4:27pm

From the logs it looks like your two other nodes (larry and moe) can’t connect to each other anymore, or something like that. You’ll probably need to provide more info to debug this, I’m technically off these days but I might be able to provide you more help tomorrow.

Unless your ZFS pool(s) is/are corrupted (which would be surprising), I don’t think there’s any risk of losing your containers.

freeekanayaka · September 23, 2018, 4:28pm

As quick checklist:

make sure connectivity works ok between the cluster nodes
make sure all nodes run the exact same LXD version (snap refresh or apt install, if not)

Tony_Anytime · September 23, 2018, 4:36pm

Your help is appreciated, I am techinically off too. but trying fix this before “work” tomorrow.
THis is larry lxd --group lxd -d
DBUG[09-23|12:29:33] Connecting to a local LXD over a Unix socket
DBUG[09-23|12:29:33] Sending request to LXD etag= method=GET url=http://unix.socket/1.0
INFO[09-23|12:29:33] LXD 3.0.1 is starting in normal mode path=/var/lib/lxd
INFO[09-23|12:29:33] Kernel uid/gid map:
INFO[09-23|12:29:33] - u 0 0 4294967295
INFO[09-23|12:29:33] - g 0 0 4294967295
INFO[09-23|12:29:33] Configured LXD uid/gid map:
INFO[09-23|12:29:33] - u 0 100000 65536
INFO[09-23|12:29:33] - g 0 100000 65536
WARN[09-23|12:29:33] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[09-23|12:29:33] Initializing local database
INFO[09-23|12:29:33] Initializing database gateway
INFO[09-23|12:29:33] Start database node address=64.71.77.80:8443 id=4
EROR[09-23|12:29:38] Failed to start the daemon: failed to create raft factory: failed to create bolt store for raft logs: timeout
INFO[09-23|12:29:38] Starting shutdown sequence
DBUG[09-23|12:29:38] Not unmounting temporary filesystems (containers are still running)
INFO[09-23|12:29:38] Saving simplestreams cache
INFO[09-23|12:29:38] Saved simplestreams cache
Error: failed to create raft factory: failed to create bolt store for raft logs: timeout

this is moe
lxd --group lxd -d
DBUG[09-23|12:29:22] Connecting to a local LXD over a Unix socket
DBUG[09-23|12:29:22] Sending request to LXD etag= method=GET url=http://unix.socket/1.0
INFO[09-23|12:29:22] LXD 3.0.1 is starting in normal mode path=/var/lib/lxd
INFO[09-23|12:29:22] Kernel uid/gid map:
INFO[09-23|12:29:22] - u 0 0 4294967295
INFO[09-23|12:29:22] - g 0 0 4294967295
INFO[09-23|12:29:22] Configured LXD uid/gid map:
INFO[09-23|12:29:22] - u 0 100000 65536
INFO[09-23|12:29:22] - g 0 100000 65536
WARN[09-23|12:29:22] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[09-23|12:29:22] Initializing local database
INFO[09-23|12:29:22] Initializing database gateway
INFO[09-23|12:29:22] Start database node address=64.71.77.32:8443 id=2
EROR[09-23|12:29:27] Failed to start the daemon: failed to create raft factory: failed to create bolt store for raft logs: timeout
INFO[09-23|12:29:27] Starting shutdown sequence
DBUG[09-23|12:29:27] Not unmounting temporary filesystems (containers are still running)
INFO[09-23|12:29:27] Saving simplestreams cache
INFO[09-23|12:29:27] Saved simplestreams cache
Error: failed to create raft factory: failed to create bolt store for raft logs: timeout
I turned off firewall.
This is Curlyjoe the failed node, lxd --group lxd -d
INFO[09-23|12:34:59] LXD 3.0.1 is starting in normal mode path=/var/lib/lxd
INFO[09-23|12:34:59] Kernel uid/gid map:
INFO[09-23|12:34:59] - u 0 0 4294967295
INFO[09-23|12:34:59] - g 0 0 4294967295
INFO[09-23|12:34:59] Configured LXD uid/gid map:
INFO[09-23|12:34:59] - u 0 100000 65536
INFO[09-23|12:34:59] - g 0 100000 65536
WARN[09-23|12:34:59] CGroup memory swap accounting is disabled, swap limits will be ignored.
INFO[09-23|12:34:59] Initializing local database
INFO[09-23|12:34:59] Initializing database gateway
INFO[09-23|12:34:59] Start database node address=64.71.77.29:8443 id=1
INFO[09-23|12:34:59] Raft: Restored from snapshot 389-2134735-1537710424317
INFO[09-23|12:34:59] Raft: Initial configuration (index=1811028): [{Suffrage:Voter ID:1 Address:0} {Suffrage:Voter ID:2 Address:64.71.77.32:8443} {Suffrage:Voter ID:4 Address:64.71.77.80:8443}]
INFO[09-23|12:34:59] Raft: Node at 64.71.77.29:8443 [Follower] entering Follower state (Leader: “”)
INFO[09-23|12:34:59] LXD isn’t socket activated
INFO[09-23|12:34:59] Starting /dev/lxd handler:
INFO[09-23|12:34:59] - binding devlxd socket socket=/var/lib/lxd/devlxd/sock
INFO[09-23|12:34:59] REST API daemon:
INFO[09-23|12:34:59] - binding Unix socket socket=/var/lib/lxd/unix.socket
INFO[09-23|12:34:59] - binding TCP socket socket=64.71.77.29:8443
INFO[09-23|12:34:59] Initializing global database
DBUG[09-23|12:34:59] Found cert k=0
DBUG[09-23|12:34:59] Failed to establish gRPC connection with 64.71.77.29:8443: 503 Service Unavailable
DBUG[09-23|12:34:59] Failed to establish gRPC connection with 64.71.77.32:8443: 503 Service Unavailable
DBUG[09-23|12:34:59] Database error: failed to begin transaction: cannot start a transaction within a transaction
EROR[09-23|12:34:59] Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to begin transaction: cannot start a transaction within a transaction
INFO[09-23|12:34:59] Starting shutdown sequence
INFO[09-23|12:34:59] Stopping REST API handler:
INFO[09-23|12:34:59] - closing socket socket=64.71.77.29:8443
INFO[09-23|12:34:59] - closing socket socket=/var/lib/lxd/unix.socket
INFO[09-23|12:34:59] Stopping /dev/lxd handler
INFO[09-23|12:34:59] - closing socket socket=/var/lib/lxd/devlxd/sock
INFO[09-23|12:34:59] Stop database gateway
INFO[09-23|12:34:59] Stop raft instance
INFO[09-23|12:34:59] Stopping REST API handler:
INFO[09-23|12:34:59] Stopping /dev/lxd handler
INFO[09-23|12:34:59] Stopping REST API handler:
INFO[09-23|12:34:59] Stopping /dev/lxd handler
DBUG[09-23|12:34:59] Not unmounting temporary filesystems (containers are still running)
INFO[09-23|12:34:59] Saving simplestreams cache
INFO[09-23|12:34:59] Saved simplestreams cache
Error: failed to open cluster database: failed to ensure schema: failed to begin transaction: cannot start a transaction within a transaction

They seem to have same version, I am afraid of rebooting working nodes.

Tony_Anytime · September 23, 2018, 6:09pm

This is what fixed it
sudo chown root:lxd /var/lib/lxd/unix.socket
then
sudo systemctl stop lxd.socket
sudo systemctl start lxd.socket

Something must have happen upon apt upgrade and then reboot.

freeekanayaka · September 23, 2018, 8:51pm

For anybody stumbling on this: the “Error: failed to create raft factory: failed to create bolt store for raft logs: timeout” message means that another lxd process is holding a lock on the db file. So make sure to kill any dangling lxd process and retry.

ben_houghton · May 16, 2020, 10:09pm

I’m having the same problem did you find a fix.

I lost a connected server with no way to recover it and now the main server’s lxd services won’t start as it can’t find the second server.

I’ve had the same problem in the past and a reboot of the second server fixed problem but now it’s dead and i’m stuck now.

I just don’t know how to break the link to the second server if I cant get into the lxc command-line tool

This might help you to get to files, its possible to mount containers folder structure and get to info if using zfs:

sudo zfs list

sudo zfs mount zpool/containers/containername

cd to /var/lib/lxd/…/…/…/ “mount point listed in zfs list”

cd rootfs "and you into systems files

However, as you said a fix is better starting again is not preferable any help on a solution would be greatly appreciated

I was wandering if i was to reinstall my apt LXD install would i lose all the containers Did you have to resort to a fresh start was this the case?

Ben