No Solution Yet-Problem after upgrade of Ubuntu 18.04 (emergency)

Tony_Anytime · February 8, 2019, 4:46am

I have a cluster of 4 servers, after apt upgrade one and rebooting, now all my servers are showing a blank lxc list, and lxc cluster show say there is no cluster.
The interesting thing, thankfully, the actual running containers are working, and I can even see inside them. But of course all LXC commands don’t work on them. lxd --debug --group lxd seems to be fine on 3 out 4 server. The containers that are not running are blank.
So at this point I am afraid of rebooting production servers, and making problem worse. I tried reboot fourth server and that does not seem make it better.
It seems the LXD cluster database is bad or stuck, is there a way to rebuild it. Or is this thing bound to crash and burn. Any and all help, ideas, etc… are welcomed.

More info:
lxc --debug list
DBUG[02-08|00:02:16] Connecting to a local LXD over a Unix socket
DBUG[02-08|00:02:16] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
Error: Get http://unix.socket/1.0: EOF

may be it is a socket error

Tony_Anytime · February 8, 2019, 5:34am

More info:
journalctl -u lxd
Feb 08 00:13:29 JOE systemd[1]: Starting LXD - main daemon…
Feb 08 00:13:29 JOE lxd[2935]: t=2019-02-08T00:13:29-0500 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
Feb 08 00:14:55 JOE lxd[2935]: t=2019-02-08T00:14:55-0500 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection
Feb 08 00:15:07 JOE lxd[2935]: t=2019-02-08T00:15:07-0500 lvl=warn msg="Failed connecting to global database (attempt 7): failed to create dqlite connection
Feb 08 00:15:15 JOE lxd[2935]: t=2019-02-08T00:15:15-0500 lvl=eror msg=“Failed to start the daemon: no “source” property found for the storage pool”
Feb 08 00:15:15 JOE lxd[2935]: Error: no “source” property found for the storage pool
Feb 08 00:15:15 JOE systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
Feb 08 00:23:29 JOE lxd[2936]: Error: LXD still not running after 600s timeout (Get http://unix.socket/1.0: EOF)
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Start-post operation timed out. Stopping.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Control process exited, code=exited status=1
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Failed with result ‘exit-code’.
Feb 08 00:23:29 JOE systemd[1]: Failed to start LXD - main daemon.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Service hold-off time over, scheduling restart.
Feb 08 00:23:29 JOE systemd[1]: lxd.service: Scheduled restart job, restart counter is at 3.
Feb 08 00:23:29 JOE systemd[1]: Stopped LXD - main daemon.
Feb 08 00:23:29 JOE systemd[1]: Starting LXD - main daemon…
Feb 08 00:23:29 JOE lxd[3066]: t=2019-02-08T00:23:29-0500 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
Feb 08 00:24:55 JOE lxd[3066]: t=2019-02-08T00:24:55-0500 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection
Feb 08 00:25:08 JOE lxd[3066]: t=2019-02-08T00:25:08-0500 lvl=warn msg="Failed connecting to global database (attempt 7): failed to create dqlite connection
Feb 08 00:25:21 JOE lxd[3066]: t=2019-02-08T00:25:21-0500 lvl=warn msg="Failed connecting to global database (attempt 8): failed to create dqlite connection
Feb 08 00:25:33 JOE lxd[3066]: t=2019-02-08T00:25:33-0500 lvl=warn msg="Failed connecting to global database (attempt 9): failed to create dqlite connection
Feb 08 00:25:46 JOE lxd[3066]: t=2019-02-08T00:25:46-0500 lvl=warn msg="Failed connecting to global database (attempt 10): failed to create dqlite connectio
Feb 08 00:25:59 JOE lxd[3066]: t=2019-02-08T00:25:59-0500 lvl=warn msg="Failed connecting to global database (attempt 11): failed to create dqlite connectio
Feb 08 00:26:01 JOE lxd[3066]: t=2019-02-08T00:26:01-0500 lvl=eror msg=“Failed to start the daemon: no “source” property found for the storage pool”
Feb 08 00:26:01 JOE lxd[3066]: Error: no “source” property found for the storage pool
Feb 08 00:26:01 JOE systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
lines 3209-3234/3234 (END)

Tony_Anytime · February 8, 2019, 5:40am

systemctl stop lxd lxd.socket
lxd --debug --group lxd

gives:
DBUG[02-08|00:38:35] Dqlite: server connection failed err=failed to establish network connection: Head https://64.71.77.80:8443/internal/database: dial tcp 64.71.77.80:8443: connect: connection refused address=64.71.77.80:8443 attempt=11
DBUG[02-08|00:38:35] Dqlite: connection failed err=no available dqlite leader server found attempt=11

Tony_Anytime · February 8, 2019, 5:47am

I am pretty sure it is a socket problem caused by upgrading 4th server, but if one server goes down, why are they all having issues? Shouldn’t others keep working otherwise what is the point of the cluster?

freeekanayaka · February 8, 2019, 8:48am

Please can you run:

lxc cluster list

on all 4 nodes and paste the output?

Tony_Anytime · February 8, 2019, 12:47pm

The 4 cluster, the one I did upgrade on is lxc cluster list
Error: Get http://unix.socket/1.0: dial unix /var/lib/lxd/unix.socket: connect: connection refused

The other just timeout with nothing.
lxc cluster list
Error: Get http://unix.socket/1.0: EOF

freeekanayaka · February 8, 2019, 1:28pm

Can you please make a tarball of the logs of all 4 nodes?

Tony_Anytime · February 8, 2019, 1:38pm

Of all Logs, like in /var/logs directory?

freeekanayaka · February 8, 2019, 1:46pm

No, just the LXD logs of all nodes.

Tony_Anytime · February 8, 2019, 1:50pm

How is best way to get them to you.

freeekanayaka · February 8, 2019, 1:50pm

Please send me a mail, free dot ekanayaka at canonical dot com.

Tony_Anytime · February 8, 2019, 2:20pm

On its way, thanks

freeekanayaka · February 8, 2019, 3:29pm

Okay, from what I can see all four nodes have the same problem: they fail to start because the don’t find the “source” configuration key for the ZFS storage pool, which should be in the database but it’s not.

I guess something went wrong during the upgrade, so we now have to figure what’s the state of the database.

Can you send me by email a tarball of the /var/lib/lxd/database directory of one of the nodes? Any node is fine. And also a tarball of /var/lib/lxd/database.bak, which should have a copy of the database as it was before the upgrade.

freeekanayaka · February 8, 2019, 3:30pm

By the way, don’t reboot, because LXD won’t recover from here, so your running containers would not be restarted.

freeekanayaka · February 8, 2019, 3:38pm

One thing you can safely try, even without me looking at your database, is to restore the database backup on all nodes and try to restart them all.

You would need to take the following steps:

Make sure all LXD daemons are stopped.
Make a backup of your current /var/lib/lxd/database directory on all nodes (for example name it /var/lib/lxd/database.after-upgrade. Do that on all nodes.
Run something like rm -r /var/lib/lxd/database; cp -r /var/lib/lxd/database.bak /var/lib/lxd/database on all nodes, in order to restore the database as it was before the update.
Restart the LXD daemons on all nodes.

Tony_Anytime · February 8, 2019, 4:03pm

I sent you databases, I can’t find the backup files anywhere.

CyrusTheVirusG · February 8, 2019, 4:03pm

I just had this happen, one of my servers was on the candidate branch. Switch all your servers to candidate and refresh. Probably after verifying one says 3.10 as the version in logs.

stgraber · February 8, 2019, 4:06pm

@CyrusTheVirusG the log for this user clearly shows an issue with storage configuration which is different than the error you get when you’re out of sync.

freeekanayaka · February 8, 2019, 4:07pm

The fact that you can’t find back up is very weird. Something odd has went on during the upgrade. @stgraber any clue of what could happen that would prevent the backup from being taken? I mean, on all nodes…

CyrusTheVirusG · February 8, 2019, 4:09pm

Are you sure? He said he upgraded one server.