Recovery from Errant lxd init

We have an LXD host that finally got upgraded from 16.04 to 18.04. In the process of this upgrade we realized that the old Apt version of LXD was still present. I was able to uninstall the Apt packages and refresh the Snap package and now the client tools and such seem to be fine.

However, unfortunately I did not discover this was the issue until after I had re-ran lxd init. Now I have an intact zpool but an LXD application that knows nothing of its previous self.
lxd recover fails with: Error: Failed validation request: Failed checking volumes on pool "default": Instance "containername" in project "default" has a different instance type in its backup file ("") and the zfs mountpoint for that container shows legacy. I can temporarily manually set the container dataset mountpoint to a known location and zfs mount and am able to view backup.yaml.

I am pretty sure my re-execution of lxd init was a bad idea but would like to know how I can recover to a working state. Thanks.

The problem is that you should have run lxd.migrate when you had both the deb and snap, that would have seamlessly transferred the data over.

The fact that you ran lxd init on the new install doesn’t really change anything, it’s just a clean empty LXD.

I think we could make LXD a bit smarter in lxd recover so that if no type is set, we assume it is containers. @tomp

In the meantime, you could go and edit the backup.yaml by hand to have a type: container inside of the container section. This may be enough to make lxd recover happy.

Thank you, Stéphane. That seems to get me past the first pass but now lxd recover fails with: snapshot inconsistency: Snapshot count in backup config and storage device are different: Backup snapshots mismatch.

I have removed the snapshot data in backup.yaml and see the snapshots on the filesystem. I do not care about recovering the snapshot on this or any other container. Can I just delete the snapshots with zfs?

Yeah, that should be fine, it just wants the two to line up.

With snapshots cleaned up I am seeing a new error.

$ sudo lxd recover
This LXD server currently has the following storage pools:
Would you like to recover another storage pool? (yes/no) [default=no]: yes
Name of the storage pool: default
Name of the storage backend (cephfs, dir, lvm, zfs, ceph, btrfs): zfs
Source of the storage pool (block device, volume group, dataset, path, ... as applicable): tank/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done): zfs.pool_name=tank/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done):
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - NEW: "default" (backend="zfs", source="tank/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
Error: Failed validation request: Post "http://unix.socket/internal/recover/validate": EOF

Hmm, that suggests LXD crashed…

Can you check journalctl -u snap.lxd.daemon -n 300 for some kind of stack trace or error?

Not a single syslog entry for that unit when running recover and receive the same result. I did snap restart lxd beforehand just to be sure. There are a few entries from the zed daemon during the recover execution but that’s it (other than the sudo entries, of course).

To close the loop on this, I was never able to get recover to work with the existing storage pool. I ended up removing/installing/initializing LXD with a new dataset. A zfs send/receive of the individual container datasets from the old to the new pool allowed lxd recover to run correctly.

I’ve just had the same problem after upgrading from Ubuntu 18.04 to the snap-only one. snap refresh lxd --channel=5.8/stable has pulled the worakrounds that @tomp is referring to, and now I’m stuck on the “backup snapshots mismatch”.

Would it be possible to add a switch to ignore the missing snapshots / adjust backup.yaml? I think me and @mgaboury are not the only two people in the world who perform do-release-upgrade on their lxd hosts, and it may help a lot of people upgrade without half a day of downtime.

Here’s a script that removes snapshots from lxd configuration. Requires yq.

for CT in $(zfs list -r -d 1 -H -o name tank/containers | tail -n +2 | rev | cut -d/ -f 1 | rev); do

        zfs set mountpoint=/var/lib/lxd/storage-pools/tank/containers/$CT tank/containers/$CT && zfs mount tank/containers/$CT 

        cp /var/lib/lxd/storage-pools/tank/containers/$CT/backup.yaml /var/lib/lxd/storage-pools/tank/containers/$CT/backup.yaml-rmsnaps
        yq -i '.snapshots=[]' /var/lib/lxd/storage-pools/tank/containers/$CT/backup.yaml
        umount /var/lib/lxd/storage-pools/tank/containers/$CT
done

After doing this, I had to remove any external disks (mounted paths) from containers because lxd-snap doesn’t allow access to host disks, and I was finally able to complete the recovery.

Overall the process was very windows-like, lxd recover being only an interactive utility, I think I typed the same things around 20 times just to see the next error.

Typing it in void probably, but here’s my wishlist:

  • When doing do-release-upgrade, when lxd upgrades itself to snap, it would be nice if it imported the containers
  • lxd recover should support command-line options, so the user can run another retry without typing the same over and over again interactively
  • lxd recover should ask about zfs pool name. Having to guess I have to type an option named zfs.pool_name=tank after answering tank, zfs, tank to the first 3 questions was puzzling.
  • There should be an option to proceed with an import in presence of failed checks, such as missing snapshots or inaccessible volumes. I’d much rather have the containers imported and 1 or 2 containers not starting because of the problems, than have nothing imported because one container failed.
  • I miss the functionality of lxd import - just importing one container. It gave so much flexibility. I’m still on lxd 3.0 on my production hosts and am using syncoid + lxd import as a simple failover mechanism. It seems there’s no way to do it with lxd 4 or 5.

I still have to figure how to mount the shared host directories - the trick with prepending snap hostfs path didn’t work. But at least I have majority of containers working now.

As @stgraber mentioned above:

You should always run sudo lxd.migrate when moving from the deb to the snap package. This would have moved your instances over to the snap, and then not required lxd recover.

lxd recover is for disaster recovery, when the LXD database has been destroyed, rather than for migration of an existing instances into a new LXD installation. This is why its not designed for routine use, and is purposefully very conservative about making automated changes/ignoring problems on the storage pools to ensure we don’t end up with a new database that also contains inconsistencies with the storage layer.

The nearest thing I can suggest that is storage driver agnostic is a periodic lxc copy <instance> <target>:<instance> --refresh which will maintain a copy of the instance ready to be brought online if needed.

Thanks about the tip with lxc copy --refresh.

It looks like it’s using rsync, so doing it every 10 minutes would put a
lot of strain to the storage. Quite possibly it won’t even finish
running in 10 minutes. Maybe daily or twice a day sync would be
possible. The zfs send/recv method adds almost no additional load to
the storage.

Skimming the source code for lxc-recover I noticed that at the end it
calls the import API endpoint. Would the import endpoint work like the
previous import command? Maybe I could write a thin API wrapper that
would restore that great functionality.

The lxd recover command uses LXD’s API, so yes you can talk to the API directly.

It uses ZFS send/recv for ZFS->ZFS pools and takes into account snapshots to only transfer differences.

It uses ZFS send/recv for ZFS->ZFS pools and takes into account
snapshots to only transfer differences.

I had no idea about this - I’m testing it now and it looks like a
viable alternative. Thank you!