Benchmarking IO on ZFS backed instances

I have been trying to get a comparison of IO performance between VMs, containers (backed by ZFS) and the host. For that I have run the following script:

COUNT="1000k"

dd if=/dev/urandom of=/tmp/input bs=1k count="$COUNT"
sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs=1k count="$COUNT"
sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs=1k count="$COUNT"

I have noticed that while on the host the IO performance is normal, within a container instance for example, both commands top out at around 1.5MB/s (as opposed to 105MB/s and 304MB/s respectively on host for the first and second set of dd commands).

What would be the optimal way to benchmark ZFS backed instances so that ARC does not inflate the performance values but also do not skip cache entirely so as to not have abysmal performance?

Following is the output of lxc config for the container used for testing:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20201210)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20201210"
  image.type: squashfs
  image.version: "20.04"
  limits.cpu: "4"
  limits.memory: 4GB
  limits.memory.enforce: hard
  volatile.base_image: e0c3495ffd489748aa5151628fa56619e6143958f041223cb4970731ef939cb6
  volatile.eth0.host_name: vetha7125db5
  volatile.eth0.hwaddr: 00:16:3e:40:2b:02
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 26ace09d-03dc-44ab-b9d1-cf1cf2b3fbe2
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: temp
    size: 50GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

After further inspection it seems that the performance was slow because I set primarycache and secondarycache to none on the ZFS pool for testing purposes, however curiously VM IO performance ins considerably better than containers.

VM output:

root@vm-test:~# COUNT="1000k"
root@vm-test:~# dd if=/dev/urandom of=/tmp/input bs=1k count="$COUNT"
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 10.0545 s, 104 MB/s
root@vm-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs=1k count="$COUNT"
3
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.62212 s, 138 MB/s
root@vm-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs=1k count="$COUNT"
3
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.72374 s, 222 MB/s

Container output:

root@container-test:~# COUNT="1000k"
root@container-test:~# dd if=/dev/urandom of=/tmp/input bs=1k count="$COUNT"
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 18.449 s, 56.8 MB/s
root@container-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs=1k count="$COUNT"
tee: /proc/sys/vm/drop_caches: Permission denied
3
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 18.4291 s, 56.9 MB/s
root@container-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs=1k count="$COUNT"
tee: /proc/sys/vm/drop_caches: Permission denied
3
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 6.07753 s, 173 MB/s

I was actually expecting for containers to have better IO performance due to less virtualization overhead. Can anyone shine some light on this?

I wonder if the chunk size may be part of the issue?

What happens if you go with 4K which tends to better align with physical boundaries on disks or even larger chunks like 4M?

Your VM on ZFS is presumably using ext4 or something inside the VM and then goes through a separate block layer, so it may be batching the actual I/O to underlying ZFS in a way that’s more performant?

That is an interesting test.

I have run both with 4k blocksize.

Container:

root@container-test:~# COUNT="1000k"
root@container-test:~# BS="4k"
root@container-test:~# dd if=/dev/urandom of=/tmp/input bs="$BS" count="$COUNT"
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 36.7805 s, 114 MB/s
root@container-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs="$BS" count="$COUNT"
tee: /proc/sys/vm/drop_caches: Permission denied
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 28.0055 s, 150 MB/s
root@container-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs="$BS" count="$COUNT"
tee: /proc/sys/vm/drop_caches: Permission denied
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 9.12889 s, 459 MB/s

VM:

root@vm-test:~# COUNT="1000k"
root@vm-test:~# BS="4k"
root@vm-test:~# dd if=/dev/urandom of=/tmp/input bs="$BS" count="$COUNT"
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 58.0146 s, 72.3 MB/s
root@vm-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs="$BS" count="$COUNT"
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 56.7583 s, 73.9 MB/s
root@vm-test:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs="$BS" count="$COUNT"
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 21.1228 s, 199 MB/s

The VM’s performance is worse than the container’s with a 4k block size, however the container’s performance seems to be inflated due to ZFS cache as it is higher than the host machine.

For reference following is the host’s IO performance:

root@lxd01:~# COUNT="1000k"
root@lxd01:~# BS="4k"
root@lxd01:~# dd if=/dev/urandom of=/tmp/input bs="$BS" count="$COUNT"
8«1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 29.8607 s, 140 MB/s
root@lxd01:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/input of=/tmp/test bs="$BS" count="$COUNT"
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 46.6395 s, 89.9 MB/s
root@lxd01:~# sync; echo 3 | tee /proc/sys/vm/drop_caches ;dd if=/tmp/test of=/dev/null bs="$BS" count="$COUNT"
3
1024000+0 records in
1024000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 16.6459 s, 252 MB/s