Lxd-to-incus migration did not succeed, and I appear to have lost my instances

tarruda · November 2, 2023, 2:17pm

I was trying to run the lxd-to-incus tool but it failed because it couldn’t remove the /var/lib/incus directory since that’s a mount point to a btrfs subvolume.

So I did a small local modification to the lxd-to-incus tool to force it to continue even without removing the /var/lib/incus directory: Update main.go · tarruda/incus@f199c15 · GitHub

The migration appeared to be going well, but after the data was migrated there was a bunch of errors due to the migration tool not being able to remove some LXD data (due to being read-only fs) and exited with an error.

After that, I restarted the daemons but now I don’t see any instances with incus ls and lxc ls. Looking at /var/lib/incus/lxd, it seems to still contain all the LXD data, but now it is in a single directory instead of being separated in different per instance subvolumes (I assume that prevents me from simply copying everything to the old lxd dir, since the subvolumes are gone there too).

I only have a handful of instances so I can probably restore the VM/container data manually, but how can I recover the instances configurations? If there’s a way I can dump the old instances config to yaml files, then I can probably restore everything the way it was

stgraber · November 2, 2023, 3:06pm

Hey there,

Can you show what you have under /var/lib/incus/ exactly?
There’s still a very good chance we can just move things around and get that started up.

Also any more detail about your brtfs setup would be useful so I can reproduce it and make sure that lxd-to-incus will handle that for other users.

tarruda · November 2, 2023, 3:50pm

Here’s my /var/lib/incus:

# tree -L 2
.
├── backups
│   ├── custom
│   └── instances
├── containers
├── containers-snapshots
├── database
│   ├── global
│   └── local.db
├── devices
├── disks
├── guestapi
│   └── sock
├── images
├── lxd
│   ├── backups
│   ├── containers
│   ├── database
│   ├── devices
│   ├── devlxd
│   ├── disks
│   ├── images
│   ├── networks
│   ├── security
│   ├── server.crt
│   ├── server.key
│   ├── shmounts -> /var/snap/lxd/common/shmounts/instances
│   ├── snapshots
│   ├── storage-pools
│   ├── unix.socket
│   ├── virtual-machines
│   └── virtual-machines-snapshots
├── networks
├── seccomp.socket
├── security
│   ├── apparmor
│   └── seccomp
├── server.crt
├── server.key
├── shmounts
├── storage-pools
├── unix.socket
├── unix.socket.user
├── virtual-machines
└── virtual-machines-snapshots

33 directories, 11 files

Here’s my /var/snap/lxd/common/lxd after I ran the tool:

# tree -L 3
.
├── backups
│   ├── custom
│   └── instances
├── cache
│   └── instance_types.yaml
├── containers
├── database
│   ├── global
│   │   ├── 0000000000000001-0000000000000001
│   │   ├── metadata1
│   │   ├── open-1
│   │   ├── open-2
│   │   └── open-3
│   └── local.db
├── devices
├── devlxd
├── disks
├── images
├── logs
│   └── lxd.log
├── networks
├── seccomp.socket
├── security
│   ├── apparmor
│   │   ├── cache
│   │   └── profiles
│   └── seccomp
├── server.crt
├── server.key
├── shmounts -> /var/snap/lxd/common/shmounts/instances
├── snapshots
├── storage-pools
│   └── default
│       ├── images
│       └── virtual-machines-snapshots
├── unix.socket
├── virtual-machines
└── virtual-machines-snapshots

As far as I can tell, everything under /var/lib/incus/lxd is a copy of what I had in /var/snap/lxd/common/lxd before running the tool. I’m currently backing everything up to an external hd before trying to restore.

My setup is simple, I have 2 nvme drives:

A 512gb one where I store my system and home directory
A 1tb one which is dedicated to LXD/Incus

Both drives are luks formatted and have btrfs on top of it.

My second nvme drive had a single subvolume called @lxd which I mounted at /var/snap/lxd. Before I installed incus, I created a second subvolume called @incus and mounted under /var/lib/incus (so that everything incus-related is under that subvolume).

So before starting the migration I had both incus and LXD with both subvolumes mounted. Since lxd-to-incus failed because it tried to remove a mount point, I made that modification to ignore the error.

I lost the lxd-to-incus failure output, but I remember seeing it complain about not being able to remove files because read-only filesystems (I think it was trying to remove some VM snapshots, because those stood intact in /var/snap/lxd/common/lxd).

tarruda · November 2, 2023, 4:34pm

It seems /var/lib/incus/lxd/database is still intact, I can open db.bin using sqlite3 and select * from instances; displays my instances.

Is there a way I can dump the config file for each instance using the db file? That should allow be to restore everything in a new incus installation.

stgraber · November 2, 2023, 6:33pm

Okay, so what I’d do is:

systemctl stop incus.service incus.socket
Move everything in /var/lib/incus EXCEPT the lxd folder out of the way, maybe in a backup folder` or something.
Move the content of /var/lib/incus/lxd/ to /var/lib/incus/
systemctl start incus.socket
incus list

The goal is basically to empty /var/lib/incus/ and move everything that’s inside /var/lib/incus/lxd/ to be in /var/lib/incus/ directly. That should get you everything back, but there may be some DB mangling needed depending on how that goes.

stgraber · November 2, 2023, 6:40pm

The migration tool checks if the source path is a mount and if so, moves it, but in your case /var/snap/lxd is a mount, not /var/snap/lxd/common/lxd, so that wouldn’t have triggered.

We don’t currently have a check for the target path, the tool just blows /var/lib/incus away before moving the data, that likely would fail if it’s a mountpoint. I’ll add a check for this case and have it print an error telling you to make sure that the target path isn’t a mountpoint as it needs to be deleted.

tarruda · November 2, 2023, 6:58pm

I did that, but incus is failing to start with the following message:

time="2023-11-02T15:49:43-03:00" level=error msg="Failed to start the daemon" err="Failed to start dqlite server: raft_start(): io: LZ4 not available"

Any hints on how to fix that? I already have liblz4-1 and lz4 installed on my system

stgraber · November 2, 2023, 7:19pm

Ah yeah, that one is easy.

Look for the files in /var/lib/incus/database/global/ and identify all the ones that are lz4 with file /var/lib/incus/database/global/*.

Then decompress them all using the lz4 -d command line tool.

This is normally done automatically by our migration tool, but it obviously didn’t get to go that far before failing in your case.

tarruda · November 2, 2023, 7:32pm

Worked perfectly, thanks!

stgraber · November 2, 2023, 7:35pm

Cool, is your Incus fully back online now?

If it is, can you show incus storage list as well as ls -lh /var/lib/incus/containers and ls -lh /var/lib/incus/virtual-machines to see if anything else needs fixing.

tarruda · November 2, 2023, 8:03pm

I was able to dump the configs by using incus config show for each instance, but something strange happened: My virtual machine and container disk images disappeared from /var/lib/incus after I tried to start a vm with incus start vm. I don’t know why that happened, but I had taken btrfs snapshots before and will try again.

tarruda · November 2, 2023, 8:05pm

BTW, I just tried to remove LXD using snap remove --purge lxd and it failed with this error:

- Remove data for snap "lxd" (26093) (unlinkat /var/snap/lxd/common/lxd/storage-pools/default/images/bca2237a85baf729a5025783ffa0bd51885e16f5b7bee76b7a73901cf76f3a71/root.img: read-only file system)

I suspect this was what caused lxd-to-incus migration to fail. I will have to manually delete those btrfs subvolumes

tarruda · November 3, 2023, 10:50am

I was able to fully access my VMs by moving the lxd directory back to /var/snap/lxd and reinstalling lxd.

Something worth noting when I moved lxd to /var/lib/incus: The /var/lib/incus/virtual-machines directory entries were still symlinks to /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/ (which was empty due to the lxd-to-incus migration). I suspect this might have triggered incus to delete the vm files when I tried incus start with one of my VMs.

I will try the migration again, this time by recreating each instance on a clean incus install and moving the data manually

tarruda · November 3, 2023, 11:34am

@stgraber another thing is that lxd-to-incus copied everything into a single lxd directory under /var/lib/incus instead of mirroring the original btrfs subvolume structure. This resulted in some custom images I had being broken because lxd can no longer use btrfs snapshot when creating a new instance based on the image. For example, here’s my lxc ls:

 $ lxc image list
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
|           ALIAS           | FINGERPRINT  | PUBLIC |             DESCRIPTION             | ARCHITECTURE |      TYPE       |    SIZE    |         UPLOAD DATE          |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
| debian-gnome-flashback-vm | bca2237a85ba | no     |                                     | x86_64       | VIRTUAL-MACHINE | 3034.02MiB | Aug 18, 2022 at 6:51pm (UTC) |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
| win10pro-oobe             | 1b9ac49a5cf8 | no     |                                     | x86_64       | VIRTUAL-MACHINE | 7167.66MiB | Sep 2, 2022 at 12:01pm (UTC) |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
|                           | 96f4afd383c1 | no     | Ubuntu jammy arm64 (20231102_07:42) | aarch64      | CONTAINER       | 112.11MiB  | Nov 2, 2023 at 1:28pm (UTC)  |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
|                           | 982883c9a347 | no     | Ubuntu jammy amd64 (20231102_07:42) | x86_64       | CONTAINER       | 118.06MiB  | Nov 2, 2023 at 1:28pm (UTC)  |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+
|                           | de71bb8b924b | no     | Ubuntu jammy arm64 (20231102_07:42) | aarch64      | VIRTUAL-MACHINE | 263.67MiB  | Nov 2, 2023 at 1:28pm (UTC)  |
+---------------------------+--------------+--------+-------------------------------------+--------------+-----------------+------------+------------------------------+

If I try:

$ lxc launch debian-gnome-flashback-vm debian-desktop
Creating debian-desktop
Error: Failed creating instance from image: Failed to run: btrfs subvolume snapshot /var/snap/lxd/common/lxd/storage-pools/default/images/bca2237a85baf729a5025783ffa0bd51885e16f5b7bee76b7a73901cf76f3a71 /var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/debian-desktop: exit status 1 (ERROR: Not a Btrfs subvolume: Invalid argument)

What would be the best way to fix this? My intution says I should simply recreate the subvolume and move the data to it, but I will wait for your advice

stgraber · November 3, 2023, 12:36pm

Hey there,

Any chance I can get access to the system?
Would probably be a lot faster to get it fixed up.

I do have a commit in my local branch of lxd-to-incus which adds the target mountpoint check, so this kind of setup should straight up fail to migrate in the future.

tarruda · November 3, 2023, 4:01pm

@stgraber I decided to wipe and start fresh on all my instances. Since I backed up all the VM disk images and configuration, I can recreate the VMs on incus and replace the disk image.

In any case thanks a lot for offering to fix it, I really appreciate it.