SQL on ZFS - Performance

michacassola · October 27, 2020, 6:11pm

Does anybody know if MySQL or generally random read-write is faster on ZFS with or without compression. Any other not easy to find performance tweaks for ZFS would also be much appreciated.

I don’t care too much about how much data is stored, performance as IOps for MySQL is the most important thing. And to be able to do snapshots/backups without container downtime.

Thanks in advance!

Edit:
I found this guide extremely useful: MariaDB / MySQL on ZFS - Shattered Silicon

stgraber · October 27, 2020, 6:41pm

Would be interesting to run some benchmarks on your system on dir vs zfs.

One thing I’d strongly recommend is that you provide a full disk or partition to back your zpool and not use a loop drive in production as that’s considerably slower.

I wouldn’t expect the default compression to slow I/O down at all, if anything, it can improve them by reducing the written amount a bit and sometimes better aligning to block size. The few tweaks I’ve seen related to this are around maybe disabling atime on the dataset or tweaking the recordsize. Those are things you can do live so it should be easy to see if they make any difference.

michacassola · October 27, 2020, 6:54pm

Did run some benchmarks with sysbench, the 30G are due to the RAM being 30G and I read this hinders RAM caching and gives better more realistic MySQL-like results.
The disk is a KVM virtualized QEMU disk (Proxmox). Supposed to be SSD but they are doing some funky stuff… Good enough though for my usage until now.

sysbench --test=fileio --file-total-size=30G prepare
sysbench --test=fileio --file-total-size=30G --file-test-mode=rndrw --max-time=300 --max-requests=0 run

EXT4 - 1

File operations:
    reads/s:                      1461.28
    writes/s:                     974.19
    fsyncs/s:                     3117.69
 
Throughput:
    read, MiB/s:                  22.83
    written, MiB/s:               15.22
 
General statistics:
    total time:                          300.0232s
    total number of events:              1665957
 
Latency (ms):
         min:                                    0.00
         avg:                                    0.18
         max:                                  139.73
         95th percentile:                        0.77
         sum:                               298476.82
 
Threads fairness:
    events (avg/stddev):           1665957.0000/0.00
    execution time (avg/stddev):   298.4768/0.00

ZFS - 1

File operations:
    reads/s:                      1193.25
    writes/s:                     795.50
    fsyncs/s:                     2545.70
 
Throughput:
    read, MiB/s:                  18.64
    written, MiB/s:               12.43
 
General statistics:
    total time:                          300.0861s
    total number of events:              1360603
 
Latency (ms):
         min:                                    0.00
         avg:                                    0.22
         max:                                  172.65
         95th percentile:                        0.97
         sum:                               298665.14
 
Threads fairness:
    events (avg/stddev):           1360603.0000/0.00
    execution time (avg/stddev):   298.6651/0.00

The DIR tests after that are pretty similar, also on another partition.

ZFS - 2

The 2nd ZFS test on another partition though was very bad. I have no idea why.

File operations:
    reads/s:                      598.61
    writes/s:                     399.07
    fsyncs/s:                     1277.08

Throughput:
    read, MiB/s:                  9.35
    written, MiB/s:               6.24

General statistics:
    total time:                          300.0961s
    total number of events:              682519

Latency (ms):
         min:                                    0.00
         avg:                                    0.44
         max:                                  133.38
         95th percentile:                        2.91
         sum:                               299088.10

Threads fairness:
    events (avg/stddev):           682519.0000/0.00
    execution time (avg/stddev):   299.0881/0.00

I am setting up another server for testing which is smaller, any testing you would like me to run?

stgraber · October 27, 2020, 8:48pm

You can try to play with:

zfs set compress=no lxd/containers/NAME
zfs set atime=off lxd/containers/NAME
zfs set recordsize=128k lxd/containers/NAME

Ideally do a test in between each to see which one is helping.

michacassola · October 27, 2020, 10:24pm

Made empty unformatted /dev/sda3 and let LXD init make the default zfs storage pool there.
After installing zfs-utils and fuse zpool list and zfs list both say there are no pools…
So how would those commands work?
Where are the container files, how can I mount them?

Also, when I install those packages through apt I end up not being able to start LXD as it cannot read the storage pool.
zpool import -f does not allow me to get it also
zpool import -f default tells me it was formatted with a newer version of ZFS…

Weird things happening, the fuse package seems to be outdated or so… Troubles, troubles…

stgraber · October 27, 2020, 11:14pm

You want zfsutils-linux, you absolutely do not want zfs-fuse, that package is an absolute disaster that should not be used anywhere

michacassola · October 27, 2020, 11:16pm

Thanks, then apt shouldn’t suggest it.

First tests confirm that compression improves performance by 250-300% for me.
Will share final findings when I am done.

michacassola · October 28, 2020, 7:03pm

Testing atime=off/on shows nearly no difference, but theoretically atime=off should be better for several containers each running their own MySQL server.
Can I turn it off globally by running zfs set atime=off default?

I found a cool article: https://martin.heiland.io/2018/02/23/zfs-tuning/#File-systems
I don’t understand what he means there, ZFS is a filesystem itself isn’t it? And the virtualized block device of a KVM VM does not have a filesystem?!

stgraber · October 28, 2020, 7:12pm

Yeah, you can set atime=off on the entire thing.

That article appears to have virtual machines where the VM drive itself is stored on ZFS, so the filesystem inside the VM then matters.

That however doesn’t apply to you since you’re dealing with containers.

michacassola · October 28, 2020, 7:17pm

I have the problem that I have a QEMU disk from Proxmox KVM, which means its a qcow2 disk most probably as it supports snapshots and is thinly provisioned. And CoW on CoW is not a good idea according to that article and it also makes sense.

Asked my ISP if I can have a different disk type or at least virtio-blk (but you said its a KVM on top of ZFS so does not make sense for my VM I guess), but I am pretty sure they are not open to giving me some custom virtual disk…

Edit: It is not a qcow2 disk, they didn’t tell me what it is though.

zfs set recordsize=128 is already the default value for my default pool.
What would be disadvantages to have it smaller or bigger?

And what about those ARC kernel settings from that article?
Will the ARC max value that is set on the host be applied to each container individually or for the whole system? Can it be regulated so that each container gets their own ARC memory allocation?

stgraber · October 28, 2020, 7:30pm

ARC is global, shared by all ZFS pools on your system. You’ll often want to set a maximum size just to avoid freaking out monitoring systems.

ZFS ARC doesn’t show as cached memory but instead as used, even though, just like cache, it will get freeed as memory pressure increases on the system. The result is that monitoring systems will think you’re running out of memory when it’s in fact just ZFS doing a bunch of caching and applications still very much being able to allocate memory if they need it.

michacassola · October 28, 2020, 7:39pm

Can the performance benefit from setting a larger value for ZFS ARC?
How much of a servers memory should be set?

What would be the problem/penalty with recordsize being other than 128?

stgraber · October 28, 2020, 7:44pm

Greater ARC will usually come with improved performance, though note that this is read caching, so you only ever need it to be as large as your “hot set”. On file servers serving TBs of data from slow drives without SSD caching, you may want tens of GBs of ARC.

But for most other cases where I/O to the underlying disk aren’t bad to start with and you don’t have GB/s of writes, having an ARC of just a few GBs is usually fine.

My understanding of recordsize is that it works in much the same way as a block size, so if you set it to a very high value, small writes will use at least one record and so potentially waste space. Using too small a value leads to needless work and fragmentation. The ideal value for something like a database would be to line it up with the size of writes performed by the database, so that when it writes something to disk, it fills exactly one record or more record.

michacassola · October 28, 2020, 7:51pm

Can I put ZIL on RAM?

stgraber · October 28, 2020, 7:53pm

It’s technically possible but extremely inadvisable as it would almost certainly lead to dataloss in the event of a system crash or unclean shutdown.

A slightly less damaging equivalent would be to play with sync=disabled on the dataset. This will effectively ignore requests from the workload to sync writes, instead letting ZFS sync to disk as convenient. Throughput should be significantly better but at the risk of losing recent writes in the event of a system crash. This however should not lead to ZFS corruption.

michacassola · October 28, 2020, 8:37pm

I read regarding Mysql on ZFS that it is good to set the recordsize to 16 like the page size in InnoDB.
This way Mysql does not need to read as much as when you set it to a larger recordsize.
Is there any disadvantage to set it to 16 for other things?

sync=disabled gave me the larges performance gain yet, from 1000 IOps (reads/s) to nearly 5000. And according to some articles I read I can only loose 5 seconds which is worth the risk for that huge speed gain.

stgraber · October 28, 2020, 8:55pm

As I said earlier, the main downside of a low recordsize would be fragmentation for larger files and so more metadata needing to be stored by ZFS but given this would only apply to newly stored data and the vast majority of that data will be the database for which the recordsize is going to be ideal, this should be perfectly fine. You just wouldn’t want to do that on a file server.

michacassola · October 28, 2020, 9:01pm

Thanks! Sorry must have oversawn that fragmentation bit.

So if I use one container to install a complete LEMP stack plus Redis on 16 recordsize, would not be a good thing?
So I should make an extra container for mysql and put that container on 16 recordsize and leave the other stuff on 128 in a different one.

stgraber · October 28, 2020, 9:03pm

Yeah or you could even go one step further and keep all your containers on the default settings but then do:

lxc storage volume create default mysql
lxc config device add CONTAINER mysql disk pool=default source=mysql path=/var/lib/mysql

That will get you a separate dataset just for mysql which you can then tweak as you wish in ZFS.

stgraber · October 28, 2020, 9:03pm

Eventually you’ll even be able to do the recordsize part of this cleanly on custom volumes:
https://github.com/lxc/lxd/issues/7901