Help with failed and botched lxd.migrate attempt (LXD 3.0.3 deb to LXD 4.0.9 snap)

JonathanK · October 4, 2022, 1:39pm

I honestly don’t remember the name of the lxd pool. I set this up in 2017 and haven’t fooled with storage pools since, although I’ve created and destroyed containers since then. I was guessing that it might be “default” since that’s seems to be how LXD works these days.

However, according to the backup.yaml below, I apparently named the lxd pool “pogo1”. So I’ll try that.

container:
  architecture: x86_64
  config:
    image.architecture: amd64
    image.description: ubuntu 18.04 LTS amd64 (release) (20190424)
    image.label: release
    image.os: ubuntu
    image.release: bionic
    image.serial: "20190424"
    image.version: "18.04"
    volatile.base_image: 5b72cf46f628b3d60f5d99af48633539b2916993c80fc5a2323d7d841f66afbe
    volatile.eth0.hwaddr: 00:16:3e:b0:b3:3a
    volatile.idmap.base: "0"
    volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":165535}]'
    volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":165535}]'
    volatile.last_state.power: RUNNING
  devices: {}
  ephemeral: false
  profiles:
  - default
  stateful: false
  description: ""
  created_at: 2019-05-12T20:02:45-04:00
  expanded_config:
    environment.http_proxy: ""
    image.architecture: amd64
    image.description: ubuntu 18.04 LTS amd64 (release) (20190424)
    image.label: release
    image.os: ubuntu
    image.release: bionic
    image.serial: "20190424"
    image.version: "18.04"
    user.network_mode: ""
    volatile.base_image: 5b72cf46f628b3d60f5d99af48633539b2916993c80fc5a2323d7d841f66afbe
    volatile.eth0.hwaddr: 00:16:3e:b0:b3:3a
    volatile.idmap.base: "0"
    volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":165535}]'
    volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":165535}]'
    volatile.last_state.power: RUNNING
  expanded_devices:
    eth0:
      name: eth0
      nictype: bridged
      parent: br0
      type: nic
    root:
      path: /
      pool: pogo1
      type: disk
  name: dolibar
  status: Stopped
  status_code: 102
  last_used_at: 2019-05-12T20:03:14.604094777-04:00
  location: ""
snapshots: []
pool:
  config:
    source: pogo1
    zfs.pool_name: pogo1
  description: ""
  name: pogo1
  driver: zfs
  used_by: []
  status: Created
  locations:
  - none
volume:
  config: {}
  description: ""
  name: dolibar
  type: container
  used_by: []
  location: none

Thanks.

tomp · October 4, 2022, 1:47pm

Great. I think it will be a process of restoring the on-disk directory structure to something that lxd recover is happy with and then hopefully it should be able to re-create the database entries from the backup.yaml files. As lxd recover is only designed to repopulate the DB and not reset on-disk structures.

You may also have to create/re-populate manually any missing profiles or profile settings as these are shared across several instances and not stored in the backup.yaml files.

JonathanK · October 4, 2022, 2:22pm

The “default” directory exists now, but judging from the timestamps below it probably was created when I ran the lxd init command which was after I tried the lxd recover command. The pogo1 directory does not exist. So I will try lxd recover using “pogo1” as the storage pool.

root@pogo:/var/snap/lxd/common/lxd/storage-pools# ll
total 12
drwx--x--x  3 root root 4096 Oct  3 11:38 ./
drwx--x--x 17 root root 4096 Oct  3 11:17 ../
drwx--x--x 10 root root 4096 Oct  3 19:36 default/
root@pogo:/var/snap/lxd/common/lxd/storage-pools# cd default
root@pogo:/var/snap/lxd/common/lxd/storage-pools/default# ll
total 40
drwx--x--x 10 root root 4096 Oct  3 19:36 ./
drwx--x--x  3 root root 4096 Oct  3 11:38 ../
drwx--x--x  2 root root 4096 Oct  3 19:36 buckets/
drwx--x--x  4 root root 4096 Oct  3 22:40 containers/
drwx--x--x  2 root root 4096 Oct  3 19:36 containers-snapshots/
drwx--x--x  2 root root 4096 Oct  3 19:36 custom/
drwx--x--x  2 root root 4096 Oct  3 19:36 custom-snapshots/
drwx--x--x  2 root root 4096 Oct  3 19:36 images/
drwx--x--x  2 root root 4096 Oct  3 19:36 virtual-machines/
drwx--x--x  2 root root 4096 Oct  3 19:36 virtual-machines-snapshots/
root@pogo:/var/snap/lxd/common/lxd/storage-pools/default# lxc storage list
+------+--------+--------+-------------+---------+-------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
+------+--------+--------+-------------+---------+-------+
You have new mail in /var/mail/root
root@pogo:/var/snap/lxd/common/lxd/storage-pools/default#

tomp · October 4, 2022, 2:32pm

You may need to re-create those directories in the /var/snap/lxd/common/lxd/storage-pools/pogo1 directory (with the same permissions) to allow lxd recover to mount onto them

JonathanK · October 4, 2022, 5:56pm

It appears that lxd recover is happier with pogo1 since it created the pogo1 directory and a container directory. However, since the first container directory that it tried did not have a backup.yaml file, it gave up and quit.

Searching through the containers, I’m finding that some do not have a backup.yaml file. However, those seem to be the ones that I don’t care about. The ones that I really need have a backup.yaml file, so I’m planning to rename those filesystems to get them out of lxd recover's path and try again.

Out of curiosity, did LXD always have a backup.yaml file? Or is this something that happened sometime after 2.0?

JonathanK · October 4, 2022, 6:18pm

Answering my own question: I have a system running LXD 2.0.11 and none of the containers have a backup.yaml file.

So the next question is when was it introduced?

I’m asking since I need to bring the other server into the modern age and hope to avoid the issues that I’ve had on my current project. So I definitely want to make sure that I at least have a backup.yaml file before I migrate to the snap version of LXD.

And the further question: Is there documentation and/or release notes that I should read before I start that upgrade process? In other words, what did I miss before I started my current project?

JonathanK · October 4, 2022, 10:10pm

I now have these results:

Error: Failed validation request: Failed checking volumes on pool "pogo1": Failed parsing backup file "/var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3/backup.yaml": open /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3/backup.yaml: no such file or directory

However, /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3/backup.yaml does exist:

root@pogo:/# cd /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3/
root@pogo:/var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3# ll
total 21
d--x------+  4 root root    6 Mar 27  2019 ./
drwx--x--x  28 root root 4096 Oct  4 12:04 ../
-r--------   1 root root 9658 Oct  1 23:21 backup.yaml
-rw-r--r--   1 root root 1048 Aug 31  2018 metadata.yaml
drwxr-xr-x  22 root root   22 Oct  1 23:21 rootfs/
drwxr-xr-x   2 root root    7 Aug 31  2018 templates/

and here are its contents:

root@pogo:/# cd /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3/
root@pogo:/var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3# ll
total 21
d--x------+  4 root root    6 Mar 27  2019 ./
drwx--x--x  28 root root 4096 Oct  4 12:04 ../
-r--------   1 root root 9658 Oct  1 23:21 backup.yaml
-rw-r--r--   1 root root 1048 Aug 31  2018 metadata.yaml
drwxr-xr-x  22 root root   22 Oct  1 23:21 rootfs/
drwxr-xr-x   2 root root    7 Aug 31  2018 templates/

Any hints or suggestions?

tomp · October 5, 2022, 7:43am

I’m wondering if the snap package’s mount namespace is out of sync with the host’s as you’ve been making manual changes.

Can you run:

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- ll /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3

If so then restarting your computer should allow them to come back into sync.

JonathanK · October 5, 2022, 3:13pm

Oct 04 2022 17:36:21 root@pogo:~# nsenter --mount=/run/snapd/ns/lxd.mnt -- ll /var/snap/lxd/common/lxd/storage-pools/pogo1/containers/fs3
nsenter: failed to execute ll: No such file or directory

So that is apparently the problem. I will try the reboot, but I’ll need to kick everyone off the system or wait until the workday is over.

How does this happen? As a very part time administrator I’d appreciate pointers to any fine manuals that can help me understand how this happened.

tomp · October 5, 2022, 3:20pm

Snap packages use their own mount namespace. This is how they achieve the effect of using a different base operating system than what the host is actually running, which allows them to be used across many different host operating systems.

But your system has gotten into quite a mess, and as you’re trying to fix it manually you are creating directories in a different mount table, that was previously mounted, and so the snap namespace is still seeing the old mount.

JonathanK · October 5, 2022, 9:28pm

I agree that things have gotten quite messed up. The namespace issue was the culprit on this latest problem. The good news is that I’ve been able to recover all of the containers that have a backup.yaml file. The better news for me is that I was able to fix the namespace issue by unmounting the file systems that I had manually mounted as root. I then let lxd recover do it’s own mounting and it happily found the containers. I was able to start them and they are doing their jobs.

At the end of this, there are several lessons that I learned from the upgrade process. I’m going to document those in a separate post and mark it as the solution in the hopes of sparing someone else what I went through.

@tomp - Thanks a bundle for all your patience and help. I really appreciate it.

JonathanK · October 6, 2022, 2:28am

First of all, a big thank you to @tomp for helping me through this. This is my attempt to give back a little. Constructive criticism and or links to Fine Manuals that I should have read are welcomed.

Here are the lessons that I learned from the LXD upgrade and migration process that I wish would have known before I started. Hopefully this will spare someone the trouble that I had.

Lesson 1: LXD introduced the backup.yaml file sometime after LXD 2.0.11. I know that feature exists in LXD 3.0.3, but I don’t know when it was introduced. So if you are migrating from LXD apt/deb to LXD snap, you should first upgrade to a version of LXD apt/deb that has the backup.yaml feature. Also, you need to make sure that you act on the lxd container(s) in a way that triggers the creation of the backup.yaml file. I had some containers that I lost b/c they didn’t have a backup.yaml file. Fortunately, I really didn’t care about those containers.

Lesson 2: LXD V4.17? switches to legacy mountpoints on ZFS. I made the mistake of mounting them outside of the namespace of the LXD process (see the rest of the discussion for details). Of course I didn’t know what a namespace was or how to use the nsenter command before I started this process.

Lesson 3: LXD snap changes the directory structure of the containers from /var/lib/lxd/containers (LXD 2.0.11) to /var/snap/lxd/common/lxd/storage-pools/[pool-name]/containers. Because of this lxd.migrate has to un-mount and re-mount the containers in the new directory structure AND the new namespace (since LXD snap is a snap). Therefore you have to convert any non-legacy ZFS mountpoints to legacy before you run the lxd.migrate script.

Lesson 4: I did not understand (and still can’t find docs) on what the expected results of lxd.migrate should be. So when lxd.migrate completed with errors, I didn’t know enough to stop and work through the errors. The end result should be that it asks the user if they want to remove the LXD apt/deb version. I’m not sure what your answer should be at that point. However, if anything goes awry before then, do NOT do like me and remove LXD apt/deb. When you do this, it will remove your database and settings and then lxd.migrate will be useless and you will have to resort to lxd recover.

Lesson 5: If you are trying to use lxd recover to salvage a failed migration you will still and expecially need all of the information above. Hopefully by being aware of these things before you start the migration process you won’t need to use lxd recover.

Lesson 6: Definitely take a backup of the lxd installation before you start the migration process. Unfortunately I am not knowledgeable enough to tell you where all the lxd files reside, so do a full system backup to be safe.

Lesson 7: If you must share a ZFS storage pool between LXD and other filesystems, you must at least give LXD its own filesystem that is not shared with other data. I made the mistake (5 years ago) of putting the LXD pool at the root of my ZFS pool. I’m sure that this was the root cause of lxd.migrate's failure. See @tomp’s analysis and my further comments below. LXD, quite reasonably, expects to be knowledgeable about and in control of everything inside the fileysystem(s) that you assign to it.

Lesson 8: Document your commands and the results, especially if something goes awry. That is essential information for getting help. If I would have done this from the beginning, then I might know what went wrong with my first migration attempt, and my notes here might be more complete and helpful. (In hindsight, I didn’t do too bad. I did manage to capture the original lxd.migrate failure that led to @tomp’s insight into the root cause of the initial failure.)

tomp · October 6, 2022, 7:50am

Thanks for the write up!

So going back to your original issue with lxd.migrate, you said it hung after failing to update the ZFS storage pool mount point. This is using LXD 4.0.9 and so LXD still set a mount point then.

The error suggests that when LXD ran zfs set mountpoint=/var/lib/snapd/hostfs/pogo1 pogo1 under the hood ZFS tried to unmount the old mount point /var/lib/snapd/hostfs/pogo1 but that failed because the mount was still in use.

It then failed to create all of the other mount points below that.

=> Updating the storage backends
error: Failed to update the storage pools: Failed to run: nsenter --mount=/run/snapd/ns/lxd.mnt zfs set mountpoint=/var/lib/snapd/hostfs/pogo1 pogo1: umount: /var/lib/snapd/hostfs/pogo1: target is busy.
cannot unmount '/var/lib/snapd/hostfs/pogo1': umount failed
cannot mount '/pogo1/backup': failed to create mountpoint
cannot mount '/pogo1/containers': failed to create mountpoint
cannot mount '/pogo1/deleted': failed to create mountpoint
cannot mount '/pogo1/deleted/images': failed to create mountpoint
cannot mount '/pogo1/gateway2': failed to create mountpoint
cannot mount '/pogo1/gateway3': failed to create mountpoint
cannot mount '/pogo1/home': failed to create mountpoint
cannot mount '/pogo1/images': failed to create mountpoint
cannot mount '/pogo1/kfse_backups': failed to create mountpoint
cannot mount '/pogo1/kms_backup': failed to create mountpoint
cannot mount '/pogo1/oldpooh2': failed to create mountpoint
cannot mount '/pogo1/psql-attempt1': failed to create mountpoint
cannot mount '/pogo1/snapshots': failed to create mountpoint
cannot mount '/pogo1/snapshots/dc3': failed to create mountpoint
cannot mount '/pogo1/snapshots/fs3': failed to create mountpoint
cannot mount '/pogo1/snapshots/samba': failed to create mountpoint
property may be set but unable to remount filesystem

Now lxd.migrate shouldn’t hang like that in that situation. But the root cause of the problem is that it appears that ZFS and snap has got their mount tables confused again and something is still holding the mount open. All of the instances should have been stopped by then, so potentially it was something else holding it open.

So my recommendation, especially if using ZFS, would be to reboot your machine and stop all instances before running lxd.migrate to ensure there is a clean mount table and no processes holding it open.

I suspect if that had been done rather than removing LXD’s via apt then lxd.migrate could have been re-run successfully.

I just tried this in a fresh Ubuntu Bionic VM going from LXD 3.0.3 deb to LXD 4.0.9 snap with a single running container on the ZFS pool

Before:

zfs list
NAME                                                                          USED  AVAIL  REFER  MOUNTPOINT
zfs                                                                           226M  13.2G    24K  none
zfs/containers                                                               2.88M  13.2G    24K  none
zfs/containers/c1                                                            2.85M  13.2G   222M  /var/lib/lxd/storage-pools/zfs/containers/c1
zfs/custom                                                                     24K  13.2G    24K  none
zfs/deleted                                                                    24K  13.2G    24K  none
zfs/images                                                                    222M  13.2G    24K  none
zfs/images/afba58aa16219124c4da851b91bd59f012ea955b982961bad7218afdabf6e89e   222M  13.2G   222M  none
zfs/snapshots                                                                  24K  13.2G    24K  none

root@v1:/# lxd.migrate 
=> Connecting to source server
=> Connecting to destination server
=> Running sanity checks

=== Source server
LXD version: 3.0.3
LXD PID: 2171
Resources:
  Containers: 1
  Images: 1
  Networks: 1
  Storage pools: 2

=== Destination server
LXD version: 4.0.9
LXD PID: 12832
Resources:
  Containers: 0
  Images: 0
  Networks: 0
  Storage pools: 0

The migration process will shut down all your containers then move your data to the destination LXD.
Once the data is moved, the destination LXD will start and apply any needed updates.
And finally your containers will be brought back to their previous state, completing the migration.

Are you ready to proceed (yes/no) [default=no]? y
=> Shutting down the source LXD
=> Stopping the source LXD units
=> Stopping the destination LXD unit
=> Unmounting source LXD paths
=> Unmounting destination LXD paths
=> Wiping destination LXD clean
=> Backing up the database
=> Moving the data
=> Updating the storage backends
=> Starting the destination LXD
=> Waiting for LXD to come online

=== Destination server
LXD version: 4.0.9
LXD PID: 13262
Resources:
  Containers: 1
  Images: 1
  Networks: 1
  Storage pools: 2

The migration is now complete and your containers should be back online.
Do you want to uninstall the old LXD (yes/no) [default=yes]? yes

All done. You may need to close your current shell and open a new one to have the "lxc" command work.
To migrate your existing client configuration, move ~/.config/lxc to ~/snap/lxd/common/config

After:

root@v1:/# zfs list
NAME                                                                          USED  AVAIL  REFER  MOUNTPOINT
zfs                                                                           226M  13.2G    24K  none
zfs/containers                                                               3.03M  13.2G    24K  none
zfs/containers/c1                                                            3.00M  13.2G   223M  none
zfs/custom                                                                     24K  13.2G    24K  none
zfs/deleted                                                                   120K  13.2G    24K  none
zfs/deleted/containers                                                         24K  13.2G    24K  none
zfs/deleted/custom                                                             24K  13.2G    24K  none
zfs/deleted/images                                                             24K  13.2G    24K  none
zfs/deleted/virtual-machines                                                   24K  13.2G    24K  none
zfs/images                                                                    222M  13.2G    24K  none
zfs/images/afba58aa16219124c4da851b91bd59f012ea955b982961bad7218afdabf6e89e   222M  13.2G   222M  none
zfs/snapshots                                                                  24K  13.2G    24K  none
zfs/virtual-machines                                                           24K  13.2G    24K  none

I then went from LXD 4.0.9 snap to LXD 5.6 snap:

snap refresh lxd --channel=latest/stable

After:

root@v1:~# zfs list
NAME                                                                          USED  AVAIL  REFER  MOUNTPOINT
zfs                                                                           226M  13.2G    24K  legacy
zfs/buckets                                                                    24K  13.2G    24K  legacy
zfs/containers                                                               3.03M  13.2G    24K  legacy
zfs/containers/c1                                                            3.00M  13.2G   223M  none
zfs/custom                                                                     24K  13.2G    24K  legacy
zfs/deleted                                                                   144K  13.2G    24K  legacy
zfs/deleted/buckets                                                            24K  13.2G    24K  legacy
zfs/deleted/containers                                                         24K  13.2G    24K  legacy
zfs/deleted/custom                                                             24K  13.2G    24K  legacy
zfs/deleted/images                                                             24K  13.2G    24K  legacy
zfs/deleted/virtual-machines                                                   24K  13.2G    24K  legacy
zfs/images                                                                    222M  13.2G    24K  legacy
zfs/images/afba58aa16219124c4da851b91bd59f012ea955b982961bad7218afdabf6e89e   222M  13.2G   222M  none
zfs/snapshots                                                                  24K  13.2G    24K  none
zfs/virtual-machines                                                           24K  13.2G    24K  legacy

And there we can see the legacy mountpoint has been applied (to stop ZFS from controlling the mount points).

Running instance’s dont have their mountpoint set to legacy until next restart.

tomp · October 6, 2022, 7:54am

If you desperately need those containers, you could manually create a backup.yaml file in the ZFS dataset (based on the others you have) and edit it to make it relevant to the container so as not to conflict with the others. Then you can re-run lxd recover and it will find them (once you move them back into place). You can run lxd recover multiple times and it will only try and restore missing instances.

JonathanK · October 6, 2022, 11:28pm

Excellent analysis! I believe that you are spot on with the idea that the mount was still in use. This points to and further reinforces Lesson 7 above. I will edit that point to make it more clear that this was likely the root cause of all my problems. Specifically, I had other non-lxd related/owned filesystems inside the pogo1 ZFS pool. I’m sure all of those non-lxd filesystems were mounted if not in active use. So,

error: Failed to update the storage pools: Failed to run: nsenter --mount=/run/snapd/ns/lxd.mnt zfs set mountpoint=/var/lib/snapd/hostfs/pogo1 pogo1: umount: /var/lib/snapd/hostfs/pogo1: target is busy

was the direct result of still having mounted sub-filesystems inside of pogo1.

Lesson 7 should be that LXD expects to have complete control and knowledge of everything inside the source/filesystem that you assign to it. So don’t go creating other stuff inside the filesystem that you assign to lxd.

That was a novice mistake that I made 5+ years ago and I didn’t even know it until now. Hopefully someone can read this and head off a disaster before it starts.

JonathanK · October 6, 2022, 11:44pm

Thanks for the tip. I have the good fortune of really not needing those containers. Since I have another migration that I will eventually need to do, I’m comforted that not having a backup.yaml file is not the end of the world. Hopefully my experience here will help me avoid ever needing this tip!