ZFS Storage Volume Refresh Fails

slapcat · January 15, 2024, 3:39pm

I recently migrated my 4-node LXD cluster to Incus. Since then I’ve had issues with refreshing storage volumes backed by ZFS across nodes in the cluster. This is the error:

$ incus storage volume cp local/apache2 local/apache2 --target=daphne --destination-target=yogi --refresh
Error: Failed to run: zfs destroy -r local/custom/default_apache2: exit status 1 (cannot destroy 'local/custom/default_apache2': dataset is busy)

If I detach the volume from the instance, delete it, and try a full copy, it works without issue. After attaching the volume to the instance and trying to refresh, I get the same error.

If I don’t attach the volume to the instance first and try to refresh, I get this error:

$ incus storage volume cp local/apache2 local/apache2 --target=daphne --destination-target=yogi --refresh
Error: Failed to run: zfs snapshot -r local/custom/default_apache2@copy-4684ffb1-fbe7-47a7-8af1-8431be5f70e9: exit status 2 (cannot open 'local/custom/default_apache2': dataset does not exist
usage:
	snapshot [-r] [-o property=value] ... <filesystem|volume>@<snap> ...

For the property list, run: zfs set|get

For the delegated permission list, run: zfs allow|unallow)

Any ideas why this is happening?

stgraber · January 17, 2024, 4:21pm

Hey,

For the second case, the one that’s failing in zfs snapshot, can you show zfs list -t all on that system (should be daphne)?

slapcat · January 17, 2024, 6:36pm

daphne is the source ; yogi is the target. Here is zfs list -t all from both:

daphne
yogi

It looks like the correct ZFS path should be local/lxd/custom/default_apache2. This is something that worked before the migration and I haven’t changed the storage pool location since then.

stgraber · January 17, 2024, 6:39pm

Ah, interesting. Can you show:

incus admin sql global "SELECT * FROM nodes"
incus admin sql global "SELECT * FROM storage_pools"
incus admin sql global "SELECT * FROM storage_pools_config"

I wonder if it’s some storage pool config that’s incorrect causing the local vs local/lxd thing.

slapcat · January 17, 2024, 9:38pm

Looks correct here, but I notice the ZFS path for the destination node (yogi) is just local/.

$ incus admin sql global "SELECT * FROM nodes" 
+----+---------+-------------+-----------------+--------+----------------+--------------------------------+-------+------+-------------------+
| id |  name   | description |     address     | schema | api_extensions |           heartbeat            | state | arch | failure_domain_id |
+----+---------+-------------+-----------------+--------+----------------+--------------------------------+-------+------+-------------------+
| 2  | booboo  |             | 10.10.0.20:8443 | 70     | 367            | 2024-01-17T21:35:29.380696868Z | 0     | 2    | <nil>             |
| 3  | yogi    |             | 10.10.0.30:8443 | 70     | 367            | 2024-01-17T21:35:25.240444074Z | 0     | 2    | <nil>             |
| 8  | meatwad |             | 10.10.0.40:8443 | 70     | 367            | 2024-01-17T21:35:26.650043144Z | 0     | 2    | <nil>             |
| 9  | daphne  |             | 10.10.0.10:8443 | 70     | 367            | 2024-01-17T21:35:25.993450212Z | 0     | 2    | <nil>             |
+----+---------+-------------+-----------------+--------+----------------+--------------------------------+-------+------+-------------------+

$ incus admin sql global "SELECT * FROM storage_pools"
+----+-------+--------+-------------+-------+
| id | name  | driver | description | state |
+----+-------+--------+-------------+-------+
| 1  | local | zfs    |             | 1     |
+----+-------+--------+-------------+-------+

$ incus admin sql global "SELECT * FROM storage_pools_config"
+----+-----------------+---------+-----------------------------+-----------+
| id | storage_pool_id | node_id |             key             |   value   |
+----+-----------------+---------+-----------------------------+-----------+
| 5  | 1               | 2       | source                      | local     |
| 6  | 1               | 2       | volatile.initial_source     | /dev/sda3 |
| 7  | 1               | 2       | zfs.pool_name               | local     |
| 8  | 1               | 3       | source                      | local     |
| 9  | 1               | 3       | volatile.initial_source     | /dev/sda4 |
| 10 | 1               | 3       | zfs.pool_name               | local     |
| 40 | 1               | 8       | source                      | pool/lxd  |
| 41 | 1               | 8       | volatile.initial_source     | pool/lxd  |
| 42 | 1               | 8       | zfs.pool_name               | pool/lxd  |
| 43 | 1               | <nil>   | volume.zfs.remove_snapshots | true      |
| 47 | 1               | 9       | source                      | local/lxd |
| 48 | 1               | 9       | volatile.initial_source     | local/lxd |
| 49 | 1               | 9       | zfs.pool_name               | local/lxd |
+----+-----------------+---------+-----------------------------+-----------+

stgraber · January 18, 2024, 1:13am

Right, so it’s a bit all over the place across your cluster.

But the config for both daphne and yogi looks consistent with the zfs list output.
That is, the pool is local/lxd on daphne and just local on yogi.

Can you try testing with a new empty volume?

incus storage volume create local test-refresh --target daphne
incus storage volume snapshot create local test-refresh snap0 --target daphne
incus storage volume copy local/test-refresh local/test-refresh --target daphne --destination-target yogi --refresh
incus storage volume snapshot create local test-refresh snap1 --target daphne
incus storage volume copy local/test-refresh local/test-refresh --target daphne --destination-target yogi --refresh
incus storage volume snapshot delete local test-refresh snap1 --target daphne
incus storage volume copy local/test-refresh local/test-refresh --target daphne --destination-target yogi --refresh

slapcat · January 18, 2024, 2:11pm

When I run the refresh copy without a full copy first, it gives:

Error: Failed generating volume copy config: Storage volume "test-refresh" in project "default" of type "custom" does not exist on pool "local": Storage volume not found

Then, trying again with a full copy, then a refresh:

$ incus storage volume copy local/test-refresh local/test-refresh --target daphne --destination-target yogi
Storage volume copied successfully!
$ incus storage volume copy local/test-refresh local/test-refresh --target daphne --destination-target yogi --refresh
Error: Failed to transfer main volume: Failed to wait for receiver: cannot receive: failed to read from stream cannot restore to local/custom/default_test-refresh@snapshot-52a9bddd-988f-49fd-a603-8fe0037bd37e: destination already exists : exit status 1

Same error each time after trying to create and then delete a snapshot.

stgraber · January 18, 2024, 11:01pm

Okay, definitely sounds like a bug going on there.
Can you file an issue at Issues · lxc/incus · GitHub?

Given we got it to happen with the completely new volume, it shouldn’t be too hard to sort this one out

slapcat · January 21, 2024, 2:26pm

Of course. I’ve created issue #413. Thanks for your help @stgraber !