I have a cluster of 4 servers, after apt upgrade one and rebooting, now all my servers are showing a blank lxc list, and lxc cluster show say there is no cluster.
The interesting thing, thankfully, the actual running containers are working, and I can even see inside them. But of course all LXC commands don’t work on them. lxd --debug --group lxd seems to be fine on 3 out 4 server. The containers that are not running are blank.
So at this point I am afraid of rebooting production servers, and making problem worse. I tried reboot fourth server and that does not seem make it better.
It seems the LXD cluster database is bad or stuck, is there a way to rebuild it. Or is this thing bound to crash and burn. Any and all help, ideas, etc… are welcomed.
More info:
lxc --debug list
DBUG[02-08|00:02:16] Connecting to a local LXD over a Unix socket
DBUG[02-08|00:02:16] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
Error: Get http://unix.socket/1.0: EOF
More info:
journalctl -u lxd
Feb 08 00:13:29 JOE systemd[1]: Starting LXD - main daemon…
Feb 08 00:13:29 JOE lxd[2935]: t=2019-02-08T00:13:29-0500 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
Feb 08 00:14:55 JOE lxd[2935]: t=2019-02-08T00:14:55-0500 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection
Feb 08 00:15:07 JOE lxd[2935]: t=2019-02-08T00:15:07-0500 lvl=warn msg="Failed connecting to global database (attempt 7): failed to create dqlite connection
Feb 08 00:15:15 JOE lxd[2935]: t=2019-02-08T00:15:15-0500 lvl=eror msg=“Failed to start the daemon: no “source” property found for the storage pool”
Feb 08 00:15:15 JOE lxd[2935]: Error: no “source” property found for the storage pool
Feb 08 00:15:15 JOE systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
Feb 08 00:23:29 JOE lxd[2936]: Error: LXD still not running after 600s timeout (Get http://unix.socket/1.0: EOF)
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Start-post operation timed out. Stopping.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Control process exited, code=exited status=1
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Failed with result ‘exit-code’.
Feb 08 00:23:29 JOE systemd[1]: Failed to start LXD - main daemon.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Service hold-off time over, scheduling restart.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Scheduled restart job, restart counter is at 3.
Feb 08 00:23:29 JOE systemd[1]: Stopped LXD - main daemon.
Feb 08 00:23:29 JOE systemd[1]: Starting LXD - main daemon…
Feb 08 00:23:29 JOE lxd[3066]: t=2019-02-08T00:23:29-0500 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
Feb 08 00:24:55 JOE lxd[3066]: t=2019-02-08T00:24:55-0500 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection
Feb 08 00:25:08 JOE lxd[3066]: t=2019-02-08T00:25:08-0500 lvl=warn msg="Failed connecting to global database (attempt 7): failed to create dqlite connection
Feb 08 00:25:21 JOE lxd[3066]: t=2019-02-08T00:25:21-0500 lvl=warn msg="Failed connecting to global database (attempt 8): failed to create dqlite connection
Feb 08 00:25:33 JOE lxd[3066]: t=2019-02-08T00:25:33-0500 lvl=warn msg="Failed connecting to global database (attempt 9): failed to create dqlite connection
Feb 08 00:25:46 JOE lxd[3066]: t=2019-02-08T00:25:46-0500 lvl=warn msg="Failed connecting to global database (attempt 10): failed to create dqlite connectio
Feb 08 00:25:59 JOE lxd[3066]: t=2019-02-08T00:25:59-0500 lvl=warn msg="Failed connecting to global database (attempt 11): failed to create dqlite connectio
Feb 08 00:26:01 JOE lxd[3066]: t=2019-02-08T00:26:01-0500 lvl=eror msg=“Failed to start the daemon: no “source” property found for the storage pool”
Feb 08 00:26:01 JOE lxd[3066]: Error: no “source” property found for the storage pool
Feb 08 00:26:01 JOE systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
lines 3209-3234/3234 (END)
I am pretty sure it is a socket problem caused by upgrading 4th server, but if one server goes down, why are they all having issues? Shouldn’t others keep working otherwise what is the point of the cluster?
The 4 cluster, the one I did upgrade on is lxc cluster list
Error: Get http://unix.socket/1.0: dial unix /var/lib/lxd/unix.socket: connect: connection refused
The other just timeout with nothing.
lxc cluster list
Error: Get http://unix.socket/1.0: EOF
Okay, from what I can see all four nodes have the same problem: they fail to start because the don’t find the “source” configuration key for the ZFS storage pool, which should be in the database but it’s not.
I guess something went wrong during the upgrade, so we now have to figure what’s the state of the database.
Can you send me by email a tarball of the /var/lib/lxd/database directory of one of the nodes? Any node is fine. And also a tarball of /var/lib/lxd/database.bak, which should have a copy of the database as it was before the upgrade.
One thing you can safely try, even without me looking at your database, is to restore the database backup on all nodes and try to restart them all.
You would need to take the following steps:
Make sure all LXD daemons are stopped.
Make a backup of your current /var/lib/lxd/database directory on all nodes (for example name it /var/lib/lxd/database.after-upgrade. Do that on all nodes.
Run something like rm -r /var/lib/lxd/database; cp -r /var/lib/lxd/database.bak /var/lib/lxd/database on all nodes, in order to restore the database as it was before the update.
I just had this happen, one of my servers was on the candidate branch. Switch all your servers to candidate and refresh. Probably after verifying one says 3.10 as the version in logs.
@CyrusTheVirusG the log for this user clearly shows an issue with storage configuration which is different than the error you get when you’re out of sync.
The fact that you can’t find back up is very weird. Something odd has went on during the upgrade. @stgraber any clue of what could happen that would prevent the backup from being taken? I mean, on all nodes…