Cannot move VMs, volumes now exist in cluster db but not in ZFS

baleygr · April 17, 2024, 8:06am

I had issues using incus move which caused a state where the VM was still on its original host, but the storage volume was present on both hosts.

It’s easy in retrospect to realize that it wasn’t very smart to use the zfs destroy command to get rid of these volumes, because I later noticed they still show up in the output of incus storage volume list. Of course I probably should have used incus commands instead to delete them, but anyway this is my current situation:

incus storage volume list shows one virtual-machine volume with 1 “used by” as well as one snapshot for a VM that does not exist on the location specified.
zfs list confirms this volume does not exist

Is there any way to “force delete” the volume from incus? I was hoping it could somehow detect that the underlying volume no longer exists and accept to delete it, but I get:

Error: Storage volumes of type "virtual-machine" cannot be deleted with the storage API

As an aside, I do think that even if I didn’t use zfs destroy I probably could have ended up with issues anyway, since the incus move command did result in the same pool being duplicated across two nodes without the VM actually being moved.

Any guidance on how to resolve issues like this? It’s not a huge issue but it’s not exactly comfortable either that this hostname is now reserved forever, since creating a new system with the same name causes a database constraint violation:

Error: Failed instance creation: Failed creating instance from image: Error inserting volume "foo" for project "default" in pool "bar" of type "virtual-machines" into database "UNIQUE constraint failed: index 'storage_volumes_unique_storage_pool_id_node_id_project_id_name_type'"

I can also mention that I have two additional “ghost volumes” which I assume are temporary but never disappeared correctly called move-of-2014907846865291386 as well as a snapshot with the same name as the VM (which is how I know they’re related).

baleygr · April 17, 2024, 8:27am

I managed to find this thread which discusses a very similar issue, unfortunately it mentions the lxd sql command and from what I can see incus sql no longer exists, so I can’t try it.

baleygr · April 17, 2024, 8:51am

I tried migrating a new VM between two hosts again, so just for the record here is what happens:

I run incus move [vm] --target [host] -v, which gives me:

Error: Migration operation failure: Instance move to destination failed: Error transferring instance data: Failed migration on target: Failed loading storage pool: Storage pool not found

After this Storage pool not found error, I run incus storage volume list [pool] (this pool has the same name across all hosts, as is required I believe), which surprisingly shows two volumes with the same name, both with a “used by” value of 1: one on the original host, and one on the target host.
incus ls shows this VM as having not moved, and it can be started up again.
If I stop it again, and run the same incus move command as before, I get:

Error: Migration operation failure: Instance move to destination failed: Error transferring instance data: Failed migration on target: Failed creating instance on target: Cannot create volume, already exists on migration target storage

So the migration fails, but the volume is created, and the migration then cannot be retried because the volume already exists.

I of course also cannot delete the duplicate volume:

# incus storage volume delete [pool] virtual-machine/[vm] --target [host]
Error: Storage volumes of type "virtual-machine" cannot be deleted with the storage API

I can however delete the VM, which leaves the duplicate volume flloating on the target host.

baleygr · April 17, 2024, 9:04am

Finally, this workaround seems to work:

Export a backup to the “target” host
Delete the VM
Import it on the target host

incus export --instance-only manual-migration /var/lib/incus/backups/manual.tar.gz

incus rm manual-migration --force

incus import /var/lib/incus/backups/manual.tar.gz

But it would be nice to 1) be able to use incus move and 2) somehow get rid of the “ghost” volumes.

candlerb · April 17, 2024, 10:25am

FYI, it’s incus admin sql. Admittedly it’s somewhat hidden: it’s not shown in the output of incus admin for example, and there’s no incus admin --all to show less-common commands.

It’s mentioned in passing: How to back up an Incus server - Incus documentation

stgraber · April 17, 2024, 2:09pm

baleygr · April 24, 2024, 8:54am

So, if I understand the previously linked post correctly, I should be able to (safely):

Make sure the data sets are deleted in ZFS, i.e. no longer appears in output of zfs list
Run incus storage volume list local to confirm they are still showing as volumes
Run incus admin sql global .dump | grep [name] to confirm that they are present in the database.
Manually delete entries from the database that I have confirmed do not exist in ZFS via id: incus admin sql global 'delete from storage_volumes where id = n' – which in cases of volumes with snapshots will also remove those snapshots from the database due to foreign key relationships.

In my case I have one volume with a related snapshot [1065, ALPHA], as well as one volume with no snapshot [1081, BETA]:

# incus admin sql global .dump | grep [my name]

INSERT INTO storage_volumes VALUES(1065,'ALPHA',1,1,3,'',1,1,'2024-04-16 14:13:46.837011140+00:00');

INSERT INTO storage_volumes VALUES(1081,'BETA',1,1,3,'',1,1,'2024-04-17 08:39:13.151275320+00:00');

# incus admin sql global .dump | grep 1065

INSERT INTO storage_volumes VALUES(1065,'ALPHA',1,1,3,'',1,1,'2024-04-16 14:13:46.837011140+00:00');

INSERT INTO storage_volumes_snapshots VALUES(1066,1065,'2024_04_16','','0001-01-01 00:00:00+00:00','2024-04-16 14:17:39.565433740+00:00');

# incus admin sql global .dump | grep 1081

INSERT INTO storage_volumes VALUES(1081,'BETA',1,1,3,'',1,1,'2024-04-17 08:39:13.151275320+00:00');

Not looking for any guarantees, but any input is appreciated