Using zfs "optimized volume transfer" between hosts, with shared base image

candlerb · March 21, 2024, 11:36am

I’m trying to understand the “optimized volume transfer” with zfs and whether it supports the use case I want.

Here’s a quick demo of the issue. nuc1 and nuc2 are both ubuntu 20.04 with incus 0.6-202403181632-ubuntu20.04 (Zabbly packages), and the default storage pool is zfs. These are independent (non-clustered) hosts.

On nuc1, I create and snapshot a container, and make two copies of it:

root@nuc1:~# incus init images:ubuntu/22.04 foobase
Creating foobase
root@nuc1:~# incus snapshot create foobase baseline
root@nuc1:~# time incus copy foobase foo1

real	0m10.368s
user	0m0.028s
sys	0m0.021s
root@nuc1:~# time incus copy foobase foo2

real	0m12.642s
user	0m0.025s
sys	0m0.021s

I expect foo1 and foo2 to be shallow copies sharing the same snapshot and underlying image; the fact that they use very little space seems to confirm this, and they do contain the same snapshot.

root@nuc1:~# zfs list zfs/lxd/containers/foobase
NAME                         USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foobase   208K  17.2G      293M  legacy
root@nuc1:~# zfs list zfs/lxd/containers/foo1
NAME                      USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foo1   208K  17.2G      293M  legacy
root@nuc1:~# zfs list zfs/lxd/containers/foo2
NAME                      USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foo2   208K  17.2G      293M  legacy

root@nuc1:~# incus snapshot list foobase
+----------+----------------------+------------+----------+
|   NAME   |       TAKEN AT       | EXPIRES AT | STATEFUL |
+----------+----------------------+------------+----------+
| baseline | 2024/03/21 11:19 GMT |            | NO       |
+----------+----------------------+------------+----------+
root@nuc1:~# incus snapshot list foo1
+----------+----------------------+------------+----------+
|   NAME   |       TAKEN AT       | EXPIRES AT | STATEFUL |
+----------+----------------------+------------+----------+
| baseline | 2024/03/21 11:19 GMT |            | NO       |
+----------+----------------------+------------+----------+
root@nuc1:~# incus snapshot list foo2
+----------+----------------------+------------+----------+
|   NAME   |       TAKEN AT       | EXPIRES AT | STATEFUL |
+----------+----------------------+------------+----------+
| baseline | 2024/03/21 11:19 GMT |            | NO       |
+----------+----------------------+------------+----------+

root@nuc1:~# zfs get origin zfs/lxd/containers/foobase
NAME                        PROPERTY  VALUE                                                                                     SOURCE
zfs/lxd/containers/foobase  origin    zfs/lxd/images/576b5965670cd19ffa17c34dc98dcc3320957423333620238a6fd78bb4d6d6f0@readonly  -
root@nuc1:~# zfs get origin zfs/lxd/containers/foo1
NAME                     PROPERTY  VALUE                                                                                     SOURCE
zfs/lxd/containers/foo1  origin    zfs/lxd/images/576b5965670cd19ffa17c34dc98dcc3320957423333620238a6fd78bb4d6d6f0@readonly  -

Now I want to be able to copy these to a remote host. I’m hoping that because they have a shared origin and/or snapshot in common, this would only copy the differences after copying the first.

root@nuc1:~# time incus copy foobase nuc2:

real	0m14.383s
user	0m0.153s
sys	0m0.057s
root@nuc1:~# time incus copy foo1 nuc2:

real	0m13.111s
user	0m0.138s
sys	0m0.060s
root@nuc1:~# time incus copy foo2 nuc2:

real	0m11.272s
user	0m0.165s
sys	0m0.025s

However, it appears that the containers were all copied in their entirety, rather than just the differences from the shared snapshot, since (a) the copy took a long time, and (b) they all use 293M of storage on the target:

root@nuc2:~# zfs list zfs/lxd/containers/foobase
NAME                         USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foobase   293M   135G      293M  legacy
root@nuc2:~# zfs list zfs/lxd/containers/foo1
NAME                      USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foo1   293M   135G      293M  legacy
root@nuc2:~# zfs list zfs/lxd/containers/foo2
NAME                      USED  AVAIL     REFER  MOUNTPOINT
zfs/lxd/containers/foo2   293M   134G      293M  legacy

root@nuc2:~# zfs get origin zfs/lxd/containers/foobase
NAME                        PROPERTY  VALUE   SOURCE
zfs/lxd/containers/foobase  origin    -       -
root@nuc2:~# zfs get origin zfs/lxd/containers/foo1
NAME                     PROPERTY  VALUE   SOURCE
zfs/lxd/containers/foo1  origin    -       -

The use case is as follows. I want to pre-build a bunch of containers, say A, B, C, D, all cloned from the same image, on host X. Then I want to transfer them to host Y (where I will be building a fresh VM image which incorporates these containers), and I want to retain the shared copy-on-write base image.

I wondered if I first needed to transfer the underlying image:

root@nuc1:~# incus image copy 576b5965670c nuc2:
Image copied successfully!

… but repeating the experiment I got the same results.

Is there a way to achieve what I’m looking for? Or does the shared base image only work when doing local copies?

Thanks,

Brian.

EDIT: I discovered the --refresh flag, and it works for incremental updates within the same instance, but it doesn’t seem to make any difference for clones with a shared snapshot or origin. First deleting the instances from nuc2:

root@nuc1:~# incus copy --refresh foobase nuc2:
root@nuc1:~# time incus copy --refresh foo1 nuc2:

real	0m25.822s
user	0m0.183s
sys	0m0.058s

root@nuc2:~# zfs list | grep zfs/lxd/containers/foo
zfs/lxd/containers/foo1                                                                                        299M   134G      294M  legacy
zfs/lxd/containers/foobase                                                                                     299M   134G      294M  legacy

EDIT 2: what if I pre-clone the target from the same image, and then try to refresh it?

root@nuc2:~# incus delete foobase foo1
root@nuc2:~# incus init 576b5965670c foobase
Creating foobase
root@nuc2:~# incus init 576b5965670c foo1
Creating foo1
root@nuc2:~#

root@nuc1:~# time incus copy --refresh foobase nuc2:
Error: Failed instance creation: Error transferring instance data: Failed migration on target: Failed creating instance on target: Failed receiving snapshot volume "foobase/baseline": Problem with zfs receive: ([exit status 1 write |1: broken pipe]) cannot receive new filesystem stream: destination 'zfs/lxd/containers/foobase' is a clone
must destroy it to overwrite it


real	0m0.654s
user	0m0.132s
sys	0m0.053s

Nope

stgraber · March 21, 2024, 2:20pm

The optimized transfer on top of ZFS basically refers to the use of send/receive with delta for snapshots as opposed to the basic rsync default behavior.

The migration API doesn’t expose the availability of other instances on the target, even if that could be used to further reduce the transfer and eventual dataset/volume size. The reason for that is that the migration API is designed to work in an environment where the source and target server don’t have to trust each other and in an environment where an unprivileged user that only sees a few instances may be the one triggering said migration.

Images would be something we wouldn’t have the same security concerns about exposing (for public or cached images anyways) but for those we have a different problem. While the exact same image may be loaded on multiple servers, the downloaded image is a tarball or qcow2 which is extracted into a ZFS dataset or volume. That means each server will have independently created the same dataset or volume. While the content is basically guaranteed to be bit for bit identical, ZFS will not have assigned them the same ID and we therefore can’t use them as a send/receive parent dataset/volume.

candlerb · March 21, 2024, 2:36pm

Aha. I thought that when doing incus image copy between two of my own hosts, and the image storage on both sides is the same type, then it could do a zfs send/receive for an exact clone. But I find that images are actually stored in the root filesystem, not in the default storage:

root@nuc3:~# ls -l /var/lib/incus/images/
total 220728
-rw-r--r-- 1 root root       840 Mar 21 14:31 c533845b5db1747674ee915cbb20df6eb47c953bb7caf1fec5b35ae9ccf98c18
-rw-r--r-- 1 root root 226017280 Mar 21 14:31 c533845b5db1747674ee915cbb20df6eb47c953bb7caf1fec5b35ae9ccf98c18.rootfs

…and presumably unpacked into the relevant storage pool when needed. This makes sense, as you might use different storage pools for different containers.

I guess that to do what I want, I’ll have to look at using zfs send/recv directly.

Thanks,

Brian.

candlerb · March 24, 2024, 3:02pm

In case anyone’s interested, I got this to work by copying the ZFS datasets and then running “incus admin recover” on the target to pick them up, although it was a bit fiddly.

The source is a physical host, the test target is an incus VM (of course!), and both have an incus storage pool of type “zfs” with name “zfs” (albeit with different dataset paths)

(On my first attempt I had created a zfs storage pool on the target with a different name to the source storage pool, and incus refused to recover from it)

Create containers on the source:

incus init -s zfs images:ubuntu/22.04/cloud foobase
incus snapshot create foobase baseline
incus copy foobase foo1
incus copy foobase foo2
incus start foo1 foo2
echo "I AM FOO1" | incus file push - foo1/README
echo "I AM FOO2" | incus file push - foo2/README
incus stop foo1 foo2
incus snapshot create foo1 clean
incus snapshot create foo2 clean

Check fingerprint of the origin incus image:

$ incus image list images: ubuntu/22.04/cloud type=container architecture=x86_64
+-----------------------------+--------------+--------+-------------------------------------+--------------+-----------+-----------+----------------------+
|            ALIAS            | FINGERPRINT  | PUBLIC |             DESCRIPTION             | ARCHITECTURE |   TYPE    |   SIZE    |     UPLOAD DATE      |
+-----------------------------+--------------+--------+-------------------------------------+--------------+-----------+-----------+----------------------+
| ubuntu/jammy/cloud (3 more) | 3c377d12c765 | yes    | Ubuntu jammy amd64 (20240324_07:42) | x86_64       | CONTAINER | 137.26MiB | 2024/03/24 00:00 UTC |
+-----------------------------+--------------+--------+-------------------------------------+--------------+-----------+-----------+----------------------+

$ zfs get origin zfs0/lxd/containers/foobase
NAME                         PROPERTY  VALUE                                                                                      SOURCE
zfs0/lxd/containers/foobase  origin    zfs0/lxd/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91@readonly  -

On source, copy base snapshot and containers to the target:

sudo zfs send -R zfs0/lxd/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91@readonly | incus exec testvm -- zfs receive zfs/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91
sudo zfs send -R -I zfs0/lxd/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91@readonly zfs0/lxd/containers/foobase@snapshot-baseline | incus exec testvm -- zfs receive zfs/containers/foobase
sudo zfs send -R -I zfs0/lxd/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91@readonly zfs0/lxd/containers/foo1@snapshot-clean | incus exec testvm -- zfs receive zfs/containers/foo1
sudo zfs send -R -I zfs0/lxd/images/3c377d12c7652ac2ecedf53c852533c3bb7a17f22a0386fd4c5befde61572f91@readonly zfs0/lxd/containers/foo2@snapshot-clean | incus exec testvm -- zfs receive zfs/containers/foo2

On target, attempt to recover them:

root@testvm:~# incus admin recover
This server currently has the following storage pools:
 - dir (backend="dir", source="/var/lib/incus/storage-pools/dir")
 - zfs (backend="zfs", source="zfs")
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - EXISTING: "dir" (backend="dir", source="/var/lib/incus/storage-pools/dir")
 - EXISTING: "zfs" (backend="zfs", source="zfs")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
Error: Failed validation request: Failed checking volumes on pool "zfs": Instance "foo1" in project "default" has snapshot inconsistency: Snapshot count in backup config and storage device are different: Backup snapshots mismatch

I thought that was odd, as they look consistent to me. On the source:

$ incus snapshot list foo1
+----------+----------------------+------------+----------+
|   NAME   |       TAKEN AT       | EXPIRES AT | STATEFUL |
+----------+----------------------+------------+----------+
| baseline | 2024/03/24 12:24 UTC |            | NO       |
+----------+----------------------+------------+----------+
| clean    | 2024/03/24 12:25 UTC |            | NO       |
+----------+----------------------+------------+----------+

$ zfs list -t snap zfs0/lxd/containers/foo1
NAME                                         USED  AVAIL     REFER  MOUNTPOINT
zfs0/lxd/containers/foo1@snapshot-baseline  88.5K      -      288M  -
zfs0/lxd/containers/foo1@snapshot-clean     18.5K      -      289M  -

On the target:

root@testvm:~# zfs list -t snap zfs/containers/foo1
NAME                                    USED  AVAIL     REFER  MOUNTPOINT
zfs/containers/foo1@snapshot-baseline  91.5K      -      288M  -
zfs/containers/foo1@snapshot-clean       15K      -      289M  -

To help understand what was going on I made a patch to incus and recompiled from source:

--- a/internal/server/storage/backend.go
+++ b/internal/server/storage/backend.go
@@ -6339,7 +6339,7 @@ func (b *backend) CheckInstanceBackupFileSnapshots(backupConf *backupConfig.Conf

        if len(backupConf.Snapshots) != len(driverSnapshots) {
                if !deleteMissing {
-                       return nil, fmt.Errorf("Snapshot count in backup config and storage device are different: %w", ErrBackupSnapshotsMismatch)
+                       return nil, fmt.Errorf("Snapshot count in backup config (%d) and storage device (%d) are different: %w", len(backupConf.Snapshots), len(driverSnapshots), ErrBackupSnapshotsMismatch)
                }
        }

And now I get a clearer error:

Error: Failed validation request: Failed checking volumes on pool “zfs”: Instance “foo1” in project “default” has snapshot inconsistency: Snapshot count in backup config (1) and storage device (2) are different: Backup snapshots mismatch

Checking:

root@testvm:~# mount -r -t zfs zfs/containers/foo1 /mnt
root@testvm:~# ls /mnt
backup.yaml  metadata.yaml  rootfs  templates
root@testvm:~# less /mnt/backup.yaml
...
volume:
  config: {}
  description: ""
  name: foo1
  type: container
  used_by: []
  location: none
  content_type: filesystem
  project: default
  created_at: 2024-03-24T12:24:28.809434807Z
volume_snapshots:
- description: ""
  expires_at: 0001-01-01T00:00:00Z
  name: baseline
  config: {}
  content_type: filesystem
  created_at: 2024-03-24T12:24:23.588777068Z
root@testvm:~# umount /mnt

So indeed, there only seems to be one snapshot in “backup.yaml”. I guess what happened is: incus snapshot create created the snapshot first, then updated backup.yaml to record the existence of the snapshot; so I need to replicate filesystem changes after the snapshot too.

OK, that’s doable:

$ sudo zfs send -i zfs0/lxd/containers/foobase@snapshot-baseline zfs0/lxd/containers/foobase | incus exec testvm -- zfs receive zfs/containers/foobase

$ sudo zfs send -i zfs0/lxd/containers/foo1@snapshot-clean zfs0/lxd/containers/foo1 | incus exec testvm -- zfs receive zfs/containers/foo1
cannot receive incremental stream: destination zfs/containers/foo1 has been modified
since most recent snapshot

(Probably because I mounted it, even read-only). Can fix that:

$ incus exec testvm -- zfs rollback zfs/containers/foo1@snapshot-clean
$ sudo zfs send -i zfs0/lxd/containers/foo1@snapshot-clean zfs0/lxd/containers/foo1 | incus exec testvm -- zfs receive zfs/containers/foo1
$ incus exec testvm -- zfs rollback zfs/containers/foo2@snapshot-clean
$ sudo zfs send -i zfs0/lxd/containers/foo2@snapshot-clean zfs0/lxd/containers/foo2 | incus exec testvm -- zfs receive zfs/containers/foo2

And now:

root@testvm:~# incus admin recover
This server currently has the following storage pools:
 - dir (backend="dir", source="/var/lib/incus/storage-pools/dir")
 - zfs (backend="zfs", source="zfs")
Would you like to recover another storage pool? (yes/no) [default=no]:
The recovery process will be scanning the following storage pools:
 - EXISTING: "dir" (backend="dir", source="/var/lib/incus/storage-pools/dir")
 - EXISTING: "zfs" (backend="zfs", source="zfs")
Would you like to continue with scanning for lost volumes? (yes/no) [default=yes]:
Scanning for unknown volumes...
The following unknown volumes have been found:
 - Container "foobase" on pool "zfs" in project "default" (includes 1 snapshots)
 - Container "foo1" on pool "zfs" in project "default" (includes 2 snapshots)
 - Container "foo2" on pool "zfs" in project "default" (includes 2 snapshots)
Would you like those to be recovered? (yes/no) [default=no]: yes
Starting recovery...
root@testvm:~#

Yay! The containers are indeed sharing the base image:

root@testvm:~# zfs list -r zfs/containers
NAME                     USED  AVAIL     REFER  MOUNTPOINT
zfs/containers          10.9M  18.6G       24K  legacy
zfs/containers/foo1     5.37M  18.6G      289M  legacy
zfs/containers/foo2     5.38M  18.6G      289M  legacy
zfs/containers/foobase   118K  18.6G      288M  legacy

and they work:

root@testvm:~# incus start foo1 foo2
root@testvm:~# incus exec foo1 -- cat /README
THIS IS FOO1
root@testvm:~# incus exec foo2 -- cat /README
THIS IS FOO2

It’s a little bit manual, but I think it’s good enough for what I need.

Cheers,

Brian.