Getting "lxd recover" to work

candlerb · September 12, 2021, 5:12pm

I have an lxd host whose SSD failed and which I’ve now replaced. The system partition is non-zfs, but the lxd storage is on zfs (dataset “zfs/lxd”). I had an up-to-date replica of the zfs dataset on another host, so I’ve copied it back onto the rebuilt host (using syncoid) and also an up-to-date copy of lxd init --dump

The machine was originally running Ubuntu 18.04 + lxd snap 4.18. The rebuilt machine is running Ubuntu 20.04, also with lxd snap 4.18.

My problem is how to get lxd to recognise this existing storage dataset.

If I run “lxd init --preseed <lxd.dump” (which was an “lxd --dump” taken previously), it tells me it can’t use the storage because it’s not empty.

Fair enough, I edited the dump file to set storage_pools: []:

...
storage_pools: []
#- config:
#    source: zfs/lxd
#    zfs.pool_name: zfs/lxd
#  description: ""
#  name: default
#  driver: zfs
...

lxd init --preseed is happy with that.

Now I need to get the storage back. However, I can’t find a way to make it work.

root@nuc2:~# lxd recover
This LXD server currently has the following storage pools:
Would you like to recover another storage pool? (yes/no) [default=no]: yes
Name of the storage pool: default
Name of the storage backend (ceph, btrfs, cephfs, dir, lvm, zfs): zfs
Source of the storage pool (block device, volume group, dataset, path, ... as applicable): zfs/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done):
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - NEW: "default" (backend="zfs", source="zfs/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
Error: Failed validation request: Failed mounting pool "default": Cannot mount pool as "zfs.pool_name" is not specified

OK, let me set that attribute:

root@nuc2:~# lxd recover
This LXD server currently has the following storage pools:
Would you like to recover another storage pool? (yes/no) [default=no]: yes
Name of the storage pool: default
Name of the storage backend (ceph, btrfs, cephfs, dir, lvm, zfs): zfs
Source of the storage pool (block device, volume group, dataset, path, ... as applicable): zfs/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done): zfs.pool_name=zfs/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done):
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - NEW: "default" (backend="zfs", source="zfs/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
No unknown volumes found. Nothing to do.

Now, the dataset does exist, but is mounted at the default location:

root@nuc2:~# zfs list -r zfs/lxd
NAME                              USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd                          66.0G   149G       96K  /zfs/lxd
zfs/lxd/containers               66.0G   149G      112K  /zfs/lxd/containers
zfs/lxd/containers/apt-cacher    9.12G   149G     5.20G  /zfs/lxd/containers/apt-cacher
zfs/lxd/containers/cache2        5.50G   149G     2.93G  /zfs/lxd/containers/cache2
... etc

I remounted it - this gave some errors but appears to have worked:

root@nuc2:~# zfs set mountpoint=/var/snap/lxd/common/lxd/storage-pools/default zfs/lxd
cannot mount '/var/snap/lxd/common/lxd/storage-pools/default': directory is not empty
property may be set but unable to remount filesystem
root@nuc2:~# ls /var/snap/lxd/common/lxd/storage-pools/default/containers
apt-cacher  cache2   ...etc
root@nuc2:~# zfs list -r zfs/lxd
NAME                              USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd                          66.0G   149G       96K  /var/snap/lxd/common/lxd/storage-pools/default
zfs/lxd/containers               66.0G   149G      112K  /var/snap/lxd/common/lxd/storage-pools/default/containers
zfs/lxd/containers/apt-cacher    9.12G   149G     5.20G  /var/snap/lxd/common/lxd/storage-pools/default/containers/apt-cacher
zfs/lxd/containers/cache2        5.50G   149G     2.93G  /var/snap/lxd/common/lxd/storage-pools/default/containers/cache2
...

Still no effect though:

root@nuc2:~# lxd recover
This LXD server currently has the following storage pools:
Would you like to recover another storage pool? (yes/no) [default=no]: yes
Name of the storage pool: default
Name of the storage backend (ceph, btrfs, cephfs, dir, lvm, zfs): zfs
Source of the storage pool (block device, volume group, dataset, path, ... as applicable): zfs/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done): zfs.pool_name=zfs/lxd
Additional storage pool configuration property (KEY=VALUE, empty when done):
Would you like to recover another storage pool? (yes/no) [default=no]: no
The recovery process will be scanning the following storage pools:
 - NEW: "default" (backend="zfs", source="zfs/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
No unknown volumes found. Nothing to do.

I’m now rather stuck. I think I am forced to discard the entire dataset, use “lxd init”, and then re-replicated the contents of the dataset into the target area and then try “lxd recover”. But for future reference I’d really like to understand how lxd recover is supposed to be used, and whether it can be used with a pre-existing dataset created by lxd.

Thanks,

Brian.

stgraber · September 12, 2021, 6:35pm

There’s a bug in lxd recover for ZFS pools like yours which I fixed last week.
The fixed LXD is currently in the candidate snap channel and will roll out to stable users tomorrow.

candlerb · September 12, 2021, 7:43pm

Thank you.

Before I received that mail, I tried using the other approach:

destroy the zfs dataset
lxd init --preseed with the original config including storage pools
destroy the zfs dataset again (to get rid of the skeleton created by lxd init)
recreate the dataset by replicating from the backup
zfs recover

This didn’t work either (“No unknown volumes found. Nothing to do.”), so following your message I updated to the latest/candidate channel:

root@nuc2:~# snap refresh  --channel=latest/candidate lxd
lxd (candidate) 4.18 from Canonical✓ refreshed
root@nuc2:~# lxd recover
This LXD server currently has the following storage pools:
 - default (backend="zfs", source="zfs/lxd")
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - EXISTING: "default" (backend="zfs", source="zfs/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "default": Failed to run: zfs mount zfs/lxd/containers/nsot: cannot mount 'zfs/lxd/containers/nsot': filesystem already mounted

root@nuc2:~# zfs list -r -o name,mounted,canmount,mountpoint zfs/lxd
NAME                             MOUNTED  CANMOUNT  MOUNTPOINT
zfs/lxd                              yes        on  /zfs/lxd
zfs/lxd/containers                   yes        on  /zfs/lxd/containers
zfs/lxd/containers/apt-cacher        yes        on  /zfs/lxd/containers/apt-cacher
zfs/lxd/containers/cache2            yes        on  /zfs/lxd/containers/cache2
...

All right, so then I unmount and try again:

root@nuc2:~# zfs unmount zfs/lxd
root@nuc2:~# zfs list -r -o name,mounted,canmount,mountpoint zfs/lxd
NAME                             MOUNTED  CANMOUNT  MOUNTPOINT
zfs/lxd                               no        on  /zfs/lxd
zfs/lxd/containers                    no        on  /zfs/lxd/containers
zfs/lxd/containers/apt-cacher         no        on  /zfs/lxd/containers/apt-cacher
zfs/lxd/containers/cache2             no        on  /zfs/lxd/containers/cache2
...
root@nuc2:~# lxd recover
This LXD server currently has the following storage pools:
 - default (backend="zfs", source="zfs/lxd")
 - plain (backend="dir", source="/var/lib/snapd/hostfs/data/lxd")
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - EXISTING: "default" (backend="zfs", source="zfs/lxd")
 - EXISTING: "plain" (backend="dir", source="/var/lib/snapd/hostfs/data/lxd")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "default": Failed to run: zfs mount zfs/lxd/containers/netbox-old: cannot mount '/zfs/lxd/containers/netbox-old': failed to create mountpoint

Unfortunately the message doesn’t say exactly what mountpoint it’s failing to create or what error occurred.

So I attached strace to the lxd parent, and I see:

pid  3713] execve("/snap/lxd/current/zfs-0.8/bin/zfs", ["zfs", "get", "-H", "-o", "name", "name", "zfs/lxd"], 0xc000162c80 /* 38 vars */ <unfinished ...>
[pid  3713] <... execve resumed>)       = 0
...
[pid  3714] execve("/snap/lxd/current/zfs-0.8/bin/zfs", ["zfs", "list", "-H", "-o", "name,type", "-r", "-t", "filesystem,volume", "zfs/lxd"], 0xc000163040 /* 38 vars */ <unfinished ...>
[pid  3714] <... execve resumed>)       = 0
...
[pid  3715] execve("/snap/lxd/current/zfs-0.8/bin/zfs", ["zfs", "mount", "zfs/lxd/containers/ns-auth"], 0xc0003d8640 /* 38 vars */ <unfinished ...>
[pid  3715] <... execve resumed>)       = 0
...
[pid  3715] lstat("/zfs/lxd/containers/ns-auth", 0x7ffc2e10e250) = -1 ENOENT (No such file or directory)
[pid  3715] mkdir("/zfs/lxd/containers/ns-auth", 0755) = -1 ENOENT (No such file or directory)
[pid  3715] access("/zfs/lxd/containers", F_OK) = -1 ENOENT (No such file or directory)
[pid  3715] access("/zfs/lxd", F_OK)    = -1 ENOENT (No such file or directory)
[pid  3715] access("/zfs", F_OK)        = -1 ENOENT (No such file or directory)
[pid  3715] mkdir("/zfs", 0755)         = -1 EROFS (Read-only file system)
[pid  3715] write(2, "cannot mount '/zfs/lxd/containers/ns-auth': failed to create mountpoint\n", 72) = 72
...
[pid  3715] +++ exited with 1 +++

(Aside: each time I try it, it halts on an apparently random container. But if I do it enough times, I see the same container tried again).

So that error message is coming directly from “zfs mount”, but this is weird. “/zfs” is the top-level mount point for the zpool:

root@nuc2:~# zpool status
  pool: zfs
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	zfs         ONLINE       0     0     0
	  zpool     ONLINE       0     0     0

errors: No known data errors
root@nuc2:~# zfs list zfs
NAME   USED  AVAIL     REFER  MOUNTPOINT
zfs   66.0G   149G      104K  /zfs

Clearly “zfs mount” thinks that “/zfs” doesn’t exist, so it tries to mkdir /zfs, which fails due to a read-only filesystem. My guess is this is something to snap containerization and namespaces/chroot.

Note that if I use “zfs mount” on the host directly, with the exact same command line, it works just fine:

root@nuc2:~# zfs mount zfs/lxd/containers/ns-auth
root@nuc2:~# zfs umount zfs/lxd/containers/ns-auth
root@nuc2:~# /snap/lxd/current/zfs-0.8/bin/zfs mount zfs/lxd/containers/ns-auth
root@nuc2:~# /snap/lxd/current/zfs-0.8/bin/zfs unmount zfs/lxd/containers/ns-auth
root@nuc2:~#

root@nuc2:~# zfs list -o name,mounted,mountpoint -r zfs/lxd
NAME                             MOUNTED  MOUNTPOINT
zfs/lxd                               no  /zfs/lxd
zfs/lxd/containers                    no  /zfs/lxd/containers
zfs/lxd/containers/apt-cacher         no  /zfs/lxd/containers/apt-cacher
zfs/lxd/containers/cache2             no  /zfs/lxd/containers/cache2
...

I’ve come across various weird problems with snaps before, and I really don’t enjoy trying to debug them.

In this case, I’m definitely stumped. However, I wonder if perhaps I change the zfs filesystem mountpoint to /var/snap/lxd/<something> or /snap/lxd/<something> before recover, it might work.

candlerb · September 12, 2021, 8:00pm

Checking what the mountpoint was on a different host, I did:

root@nuc2:~# zfs set mountpoint=/var/snap/lxd/common/lxd/storage-pools/default zfs/lxd

Definitely getting closer:

Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "default": Instance "proxmox2.old" in project "default" has a different instance name in its backup file ("proxmox2")

OK, I know what’s going on here. I intentionally left some previous versions of containers on my backup server (renaming then, prior to rebuilding those containers). I’ve now fixed this by deleting the local datasets - although in my opinion it would be much better if lxd recover could restore what it could, and skip over any containers which it found to be invalid.

Try again:

Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "default": Instance "etcd2" in project "default" has a different instance type in its backup file ("")

Mounting it, I can’t see what it means by “instance type” in etcd2/backup.yaml. But as this is a container I don’t need, for now it gets the zfs destroy -r treatment too. (I can always resync from my backup)

Next attempt gives a different error:

...
Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "default": Failed parsing backup file "/var/snap/lxd/common/lxd/storage-pools/default/containers/nsot/backup.yaml": open /var/snap/lxd/common/lxd/storage-pools/default/containers/nsot/backup.yaml: no such file or directory

It’s quite right, there’s no backup.yaml there:

root@nuc2:~# zfs mount zfs/lxd/containers/nsot
root@nuc2:~# ls /var/snap/lxd/common/lxd/storage-pools/default/containers/nsot/
metadata.yaml  rootfs  templates
root@nuc2:~# zfs unmount zfs/lxd/containers/nsot

Maybe this container was made with an older version of lxd and never run with a newer one. Zap again.

And finally: the remaining volumes are recovered successfully, and I can start the containers. Phew!

That was rather harder work than I was expecting. Maybe in future I’ll just restore /var/snap/lxd/common/lxd, although I was hoping not to have to do that.

Regards,

Brian.

tomp · September 13, 2021, 9:22am

For future reference the bug was fixed in:

https://github.com/lxc/lxd/pull/9208