Need help: Recovering instances from a failed system

Hello!

I have server that has completely gone off the rails and LXC is not usable or operable - When I give a command to LXC the system just sits there. The logs have only yielded hundreds of instances of the below error.

t=2020-07-01T22:26:15-0700 lvl=warn msg="Dqlite: attempt 5: server 0: dial: Failed to connect to HTTP endpoint: dial tcp: address 0: missing port in address"

Considering there are other things wrong with this system and I am more inclined to re image it, how would you recommend I export a few containers for them to be imported later?

Sounds like that system had clustering enabled?
Is it actually part of a cluster?

Could you show lxd cluster list-database?

Here you go!

I remember enabling clustering, but it is a single node. The plan was to add additional nodes soon.

wyatt@cmp4rpp-h1:~$ sudo lxd cluster list-database
+---------+
| ADDRESS |
+---------+
| 0       |
+---------+

Wow, that looks quite badly wrong.
Can you try lxd cluster recover-from-quorum-loss see if that fixes stuff for you somehow?

@freeekanayaka any idea how this could have happened?

I’ve tried that, but it never completes.

I’m thinking of copying the /var/snap/lxd/common folder, snap remove lxd, snap install lxd, copy the common files back. What do you think?

That will get you into the exact same position, so no, not really something you should be doing.

Try:

  • sqlite3 /var/snap/lxd/common/lxd/database/local.db “UPDATE config SET value=‘127.0.0.1:8443’ WHERE key IN (‘core.https_address’, ‘cluster.https_address’);”
  • sqlite3 /var/snap/lxd/common/lxd/database/local.db “UPDATE raft_nodes SET address=‘127.0.0.1:8443’;”

Then hopefully LXD will feel like starting up again and be able to connect to itself for the cluster database.

After making those changes and restarting the server, where is what I get now.

wyatt@cmp4rpp-h1:~$ sudo lxc list
Error: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused

Try sudo systemctl start snap.lxd.daemon and then do the sudo lxc list again. systemd likely has long given up on LXD starting :slight_smile:

Yeah, that 0 address is being a bit problematic…

Ok, so different approach, what do you have on that system?
Is it just a bunch of containers or do you also have images and custom storage volumes that you care about?

And what storage backend are you using?

Its a small offsite server that has a few containers on it. I have two containers that I have to get a few files backed up from, after that I can scrub the whole system and be OK with that.

We are using the default setup, I believe lxcfs. We have a singular “local.img” in “/var/snap/lxd/common/lxd/disks”.

Ok, can you get zfs list -t all? You may need to install zfsutils-linux if you don’t have it already installed.

wyatt@cmp4rpp-h1:/var/snap/lxd/common$ zfs list -t all
no datasets available

Ok, you’ll need to run sudo zpool import -d /var/snap/lxd/common/lxd/disks -a which should then have sudo zfs list -t all show the datasets.

This is weird, no luck.

root@cmp4rpp-h1:/var/snap/lxd/common# zpool import -d /var/snap/lxd/common/lxd/disks -a
no pools available to import

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# ls -lsa
total 32180508
       4 drwx------  2 root root           4096 Feb 29 17:56 .
       4 drwx--x--x 18 lxd  nogroup        4096 Jul  2 10:14 ..
32180500 -rw-------  1 root root    80000000000 Jul  1 20:14 local.img

A bit weird indeed. Can you show sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img?

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

That doesn’t look like zfs… Can you show sudo file /var/snap/lxd/common/lxd/disks/local.img?