Need help: Recovering instances from a failed system

wyattwic · July 2, 2020, 5:28am

Hello!

I have server that has completely gone off the rails and LXC is not usable or operable - When I give a command to LXC the system just sits there. The logs have only yielded hundreds of instances of the below error.

t=2020-07-01T22:26:15-0700 lvl=warn msg="Dqlite: attempt 5: server 0: dial: Failed to connect to HTTP endpoint: dial tcp: address 0: missing port in address"

Considering there are other things wrong with this system and I am more inclined to re image it, how would you recommend I export a few containers for them to be imported later?

stgraber · July 2, 2020, 2:35pm

Sounds like that system had clustering enabled?
Is it actually part of a cluster?

Could you show lxd cluster list-database?

wyattwic · July 2, 2020, 4:22pm

Here you go!

I remember enabling clustering, but it is a single node. The plan was to add additional nodes soon.

wyatt@cmp4rpp-h1:~$ sudo lxd cluster list-database
+---------+
| ADDRESS |
+---------+
| 0       |
+---------+

stgraber · July 2, 2020, 4:29pm

Wow, that looks quite badly wrong.
Can you try lxd cluster recover-from-quorum-loss see if that fixes stuff for you somehow?

stgraber · July 2, 2020, 4:29pm

@freeekanayaka any idea how this could have happened?

wyattwic · July 2, 2020, 4:31pm

I’ve tried that, but it never completes.

I’m thinking of copying the /var/snap/lxd/common folder, snap remove lxd, snap install lxd, copy the common files back. What do you think?

stgraber · July 2, 2020, 4:33pm

That will get you into the exact same position, so no, not really something you should be doing.

stgraber · July 2, 2020, 4:35pm

Try:

sqlite3 /var/snap/lxd/common/lxd/database/local.db “UPDATE config SET value=‘127.0.0.1:8443’ WHERE key IN (‘core.https_address’, ‘cluster.https_address’);”
sqlite3 /var/snap/lxd/common/lxd/database/local.db “UPDATE raft_nodes SET address=‘127.0.0.1:8443’;”

stgraber · July 2, 2020, 4:35pm

Then hopefully LXD will feel like starting up again and be able to connect to itself for the cluster database.

wyattwic · July 2, 2020, 5:06pm

After making those changes and restarting the server, where is what I get now.

wyatt@cmp4rpp-h1:~$ sudo lxc list
Error: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

stgraber · July 2, 2020, 5:09pm

Try sudo systemctl start snap.lxd.daemon and then do the sudo lxc list again. systemd likely has long given up on LXD starting

stgraber · July 2, 2020, 5:31pm

Yeah, that 0 address is being a bit problematic…

Ok, so different approach, what do you have on that system?
Is it just a bunch of containers or do you also have images and custom storage volumes that you care about?

And what storage backend are you using?

wyattwic · July 2, 2020, 5:38pm

Its a small offsite server that has a few containers on it. I have two containers that I have to get a few files backed up from, after that I can scrub the whole system and be OK with that.

We are using the default setup, I believe lxcfs. We have a singular “local.img” in “/var/snap/lxd/common/lxd/disks”.

stgraber · July 2, 2020, 5:40pm

Ok, can you get zfs list -t all? You may need to install zfsutils-linux if you don’t have it already installed.

wyattwic · July 2, 2020, 5:41pm

wyatt@cmp4rpp-h1:/var/snap/lxd/common$ zfs list -t all
no datasets available

stgraber · July 2, 2020, 5:44pm

Ok, you’ll need to run sudo zpool import -d /var/snap/lxd/common/lxd/disks -a which should then have sudo zfs list -t all show the datasets.

wyattwic · July 2, 2020, 5:46pm

This is weird, no luck.

root@cmp4rpp-h1:/var/snap/lxd/common# zpool import -d /var/snap/lxd/common/lxd/disks -a
no pools available to import

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# ls -lsa
total 32180508
       4 drwx------  2 root root           4096 Feb 29 17:56 .
       4 drwx--x--x 18 lxd  nogroup        4096 Jul  2 10:14 ..
32180500 -rw-------  1 root root    80000000000 Jul  1 20:14 local.img

stgraber · July 2, 2020, 5:50pm

A bit weird indeed. Can you show sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img?

wyattwic · July 2, 2020, 5:51pm

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

stgraber · July 2, 2020, 6:24pm

That doesn’t look like zfs… Can you show sudo file /var/snap/lxd/common/lxd/disks/local.img?