Need help: Recovering instances from a failed system

stgraber · July 2, 2020, 5:44pm

Ok, you’ll need to run sudo zpool import -d /var/snap/lxd/common/lxd/disks -a which should then have sudo zfs list -t all show the datasets.

wyattwic · July 2, 2020, 5:46pm

This is weird, no luck.

root@cmp4rpp-h1:/var/snap/lxd/common# zpool import -d /var/snap/lxd/common/lxd/disks -a
no pools available to import

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# ls -lsa
total 32180508
       4 drwx------  2 root root           4096 Feb 29 17:56 .
       4 drwx--x--x 18 lxd  nogroup        4096 Jul  2 10:14 ..
32180500 -rw-------  1 root root    80000000000 Jul  1 20:14 local.img

stgraber · July 2, 2020, 5:50pm

A bit weird indeed. Can you show sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img?

wyattwic · July 2, 2020, 5:51pm

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# sudo zdb -l /var/snap/lxd/common/lxd/disks/local.img
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

stgraber · July 2, 2020, 6:24pm

That doesn’t look like zfs… Can you show sudo file /var/snap/lxd/common/lxd/disks/local.img?

wyattwic · July 2, 2020, 9:14pm

Sorry for the delay. Looks like BTRFS

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# sudo file /var/snap/lxd/common/lxd/disks/local.img
/var/snap/lxd/common/lxd/disks/local.img: BTRFS Filesystem label "local", sectorsize 4096, nodesize 16384, leafsize 16384, UUID=53b02392-73bd-467d-b9db-5d97fcbed85f, 25118007296/80000000000 bytes used, 1 devices

stgraber · July 2, 2020, 9:22pm

Ok, so you can try:

sudo systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
sudo mv /var/snap/lxd/common/lxd/database /var/snap/lxd/common/lxd/database.broken
sudo nsenter --mount=/run/snapd/ns/lxc.mnt mount -o loop /var/snap/lxd/common/lxd/disks/local.img /var/snap/lxd/common/lxd/storage-pools/local/
sudo lxc list
sudo lxd import NAME-OF-CONTAINER

wyattwic · July 2, 2020, 9:31pm

No dice on the nsenter, any ideas?

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# sudo nsenter --mount=/run/snapd/ns/lxc.mnt mount -o loop /var/snap/lxd/common/lxd/disks/local.img /var/snap/lxd/common/lxd/storage-pools/local/
nsenter: cannot open /run/snapd/ns/lxc.mnt: No such file or directory

wyattwic · July 2, 2020, 9:42pm

There is no LXC, but there is a LXD. Trying that.

–mount=/run/snapd/ns/lxd.mnt

wyattwic · July 2, 2020, 9:45pm

That didn’t work.

stgraber · July 2, 2020, 10:09pm

Oh yeah, that was a typo, what happened with sudo nsenter --mount=/run/snapd/ns/lxd.mnt mount -o loop /var/snap/lxd/common/lxd/disks/local.img /var/snap/lxd/common/lxd/storage-pools/local/ ?

wyattwic · July 2, 2020, 10:13pm

The command executed without error, nothing is visible in “/var/snap/lxd/common/lxd/storage-pools/local/” and the import command isn’t working.

Attempting to rerun the command returns that it’s already mounted.

stgraber · July 2, 2020, 10:23pm

Ok, so that’s a good sign.

What does sudo lxc list show? It should show an empty list.
And what does the lxd import NAME get you?

wyattwic · July 2, 2020, 10:29pm

list is empty, import does not pick anything up.

root@cmp4rpp-h1:/var/snap/lxd/common/lxd/disks# lxc import WebServices
Error: open /var/snap/lxd/common/lxd/disks/WebServices: no such file or directory

stgraber · July 2, 2020, 10:31pm

I said lxd import not lxc import

wyattwic · July 2, 2020, 10:34pm

DERP! That worked.

Okay, now we’re back on track! I’ll get these container’s back into the system, properly export them then reload the host. Is LXC import/export still the way to go.

stgraber · July 2, 2020, 10:36pm

Yeah, lxd import will perform the disaster recovery to re-generate the DB entries, your containers should then work as normal.

If you have multiple systems, doing a lxc copy or lxc move over the network is usually a fair bit faster than export+import but if you need an offline storage solution, then lxc export is the way to go.

wyattwic · July 2, 2020, 10:41pm

Thank you for the help Stéphane!

paul · July 7, 2020, 8:08am

For what it’s worth: I had exactly this issue. Installed lxd (via snap) on Ubuntu 20.04 (after upgrading it from 18.04). It worked at first, but now I wanted to work with some containers it gave me this error. Processes got stuck. Database has address 0.

I compared the database.pre-migration database, and it looks like the raft_nodes were not properly migrated, ie. changed from the actual IP-adres and port to “0”. If I put that value for the raft_nodes back I get 404 errors.

I have ZFS, I tried doing an import of my containers, but they suddenly appear empty even.

Rather problematic, the upgrade and migration that happened in May?

paul · July 7, 2020, 8:14am

Yet the fix was to stop lxd (via snap), reset the raft_nodes value back to the IP-address and port it previously had, and start again. On a “live system” that gave an error, but stop/start worked.

I tried the lxd import, but that didn’t work well for me in zfs. I probably had to configure lxd to use ZFS first given the database was cleared out, before doing an import? (I assumed I didn’t need an nsenter.)

Glad I was able to recover, but hope to understand what went wrong and ways to prevent this. Kind of fear the automatic upgrades that snap is doing at the moment.