SQL on ZFS - Performance

michacassola · October 28, 2020, 9:01pm

Thanks! Sorry must have oversawn that fragmentation bit.

So if I use one container to install a complete LEMP stack plus Redis on 16 recordsize, would not be a good thing?
So I should make an extra container for mysql and put that container on 16 recordsize and leave the other stuff on 128 in a different one.

stgraber · October 28, 2020, 9:03pm

Yeah or you could even go one step further and keep all your containers on the default settings but then do:

lxc storage volume create default mysql
lxc config device add CONTAINER mysql disk pool=default source=mysql path=/var/lib/mysql

That will get you a separate dataset just for mysql which you can then tweak as you wish in ZFS.

stgraber · October 28, 2020, 9:03pm

Eventually you’ll even be able to do the recordsize part of this cleanly on custom volumes:

michacassola · October 28, 2020, 9:06pm

Wow! I think that solves the problem that I just had in the back of my mind to put both containers on the same resource limitations (cgroup) as I can leave everything in the container as is and just mysql data gets an optimized backend/volume! Super cool that is!

Nick_Knutov · October 28, 2020, 9:06pm

I own small hosting company, specialized on shared hosting for Bitrix/Bitrix24 cms/crm. Bitrix is known to be badly optimized and are heavily using mysql.

We did tests on real load (on modern server hardware with NVMe) and our results are pretty expected - mysql on zfs works about 3-5% slowly with proper optimization compared to ext4. But zfs has some benefits for our workloads and we are using zfs a lot.

The main things you mostly want to configure is recordsize - it must be 16k for mysql folder. Otherwise you will suffer from write amplification problem.

Also, atime should be always off in all real life scenarios.

If you expect much benefits from compression you do not have to swith to zfs, you can use compression by mysql, it is supported for years (since 5.5 I suppose). Google about “mysql row_format=compressed” for details.

Per-column compression is also possible since mysql 8.0 as far as I remember.

michacassola · October 28, 2020, 9:12pm

Hi @Nick_Knutov, I am also building a small hosting company. Specialized in WordPress mostly.
I only care about the speed of the website, but it has to be on the backend I have as it is good resources for good money, this way my business model will be sustainable.
I only leave compression in ZFS on as it improves speed for my VPS.

@stgraber so you are saying that the pool will not be the same (default)?
What am I doing let me just fiddle some more.

michacassola · October 28, 2020, 9:16pm

Regarding storage quota, how would I get customer-container and customer-mysql-volume be on the same count?

stgraber · October 28, 2020, 9:41pm

Ah yeah, that’s the catch with storage volumes, since they’re not tied to an instance and can be shared with multiple instances, they have their own quotas.

I guess you could give a pretty small quota for the instance and then give a more generous quota for the mysql volume.

Also, you’ll almost certainly want to enable refquota on the LXD pool so that the amount that’s considered for the quota is the entire usage rather than just the copy-on-write delta from the base image.

michacassola · October 28, 2020, 9:48pm

Thanks!

So creating the pool inside the container pool is probably not possible?

default/containers/my-container/
default/containers/my-container/mysql

Does LXD apply the quota with refquota if you set it on a containers root with an lxc command?

stgraber · October 28, 2020, 9:51pm

LXD normally sets quota and queries used, you can change that behavior by setting volume.zfs.use_refquota to true on your pool.

Once done, LXD will set refquota on all new instances or instance resizes and will report the usage based on referenced instead of `used.

michacassola · October 30, 2020, 2:09pm

Will the following conflict with any LXD things?
If I copy a hacked setup like that to another server for backup will the zfs filesystems come with it in their stacked order?

Here the commands:

zfs create -o mountpoint=none default/containers/CONTAINER/mysql
zfs set recordsize=16 default/containers/CONTAINER/mysql
lxc storage create mysql-pool zfs source=default/containers/CONTAINER/mysql
lxc storage volume create mysql-pool mysql
lxc config device add CONTAINER mysql disk pool=mysql-pool source=mysql path=/var/lib/mysql

It would be cool to have a folder like this from the container to be mapped to a sub-zfs-filesystem that is in the same container-zfs-filesystem so we could set different zfs properties on them like is needed for mysql. Is that what the future holds, what you referenced above?

Suggestion make all of what I have done above (minus LXD creating an entire forest of sub-filesystems) into a single and easy command (if the storage quota zfs uses also adds sub-filesystems to the total amount):

lxc config device add CONTAINER DEVICE-NAME disk --subfilesystem path=/var/lib/mysql

Of course I could be way off here, but due to databases being widely used and in an effort to try to keep things simple if you want to have a container and any volumes only taking up so much space totally, I do think this kind of thing could be really good.

It comes to mind that a database page size option that automatically sets recordsize of zfs in that would also be useful:

lxc config device add CONTAINER DEVICE-NAME disk --subfilesystem --recordsize=16k path=/var/lib/mysql
# OR
lxc config device add CONTAINER DEVICE-NAME disk --subfilesystem --database path=/var/lib/mysql

michacassola · October 30, 2020, 5:17pm

Ok, there seems to be something going wrong badly with the hack there.
Installing MySQL to test didn’t work, as the server wouldn’t start. Restarting did not work, the isntance did not come back up due to some storage issue.

So the new way is:

Create storage pool for a container with one quota with recordsize=128k
Put the container on there
Create the custom volume on the same container pool and have recordsize=16k on it
Mount the volume at the MySQL folder as shown

I think that should work.
Now the question remains:

Will that pool and its settings be carried over to the backup server with the same ZFS setup when using copy?
How to best back this up to another LXD server?

Nick_Knutov · October 31, 2020, 10:12am

I assume you can add external partition/folder to the container, but it must be in the different path/namespace.

The sadly truth is LXD is not designed for the things like shared hosting.

You will also have problems with user quotas with LXD on ZFS.

michacassola · October 31, 2020, 1:38pm

Stéphane already showed how to put a different filesystem (“folder”) into the mysql path of a container. Thanks for that by the way @stgraber! Would have taken me a lot of time to get there by myself!

LXD let’s you make containers. They are are designed to be contained and that’s what I use them for. Kinsta also uses them, so simply and bluntly put: Yes you can use Linux Containers for hosting, which contradicts your statement. They are great for building hosting as they have less overhead than a fully virtualized KVM VPS. And if you run them on ZFS, with a bit of tweaking you can even get a performance benefit over EXT4 for MySQL.
What you might mean is: Overselling is bad. Yes it is, I don’t do that at all.

Also your statement about user quotas is not correct, it is quite easy to set an effective quota on a zfs filesystem: zfs set quota=336g zfs/container-pool
This is how my container creation looks as of now:

lxc storage create container-pool zfs source=zfs/container-pool
zfs set quota=336g zfs/container-pool  ### absolute total for the whole pool
lxc storage volume create container-pool mysql-volume
zfs set recordsize=16k zfs/container-pool/custom/zfs_mysql-volume
lxc init ubuntu:20.04 container-name -s container-pool
lxc config device add container-name mysql-disk disk pool=container-pool source=mysql-volume path=/var/lib/mysql
lxc start container-name

Yosu_Cadilla · February 1, 2023, 1:15pm

Hey @michacassola how did this setup end working for you?
Are you still using it?
Did you do any improvements and/or fine tuning to it?

michacassola · February 1, 2023, 4:49pm

Hey @Yosu_Cadilla, long time no read.

Yes I am still using it.
I am also hardcoding the IPs by overwriting the eth0 device and attaching the volatile hardware address to it. This way the container will always get the same IPs from DHCP and SLAAC (as far as I know).

Any specific use cases in mind?

Yosu_Cadilla · February 1, 2023, 7:07pm

Any specific use use cases in mind?

Yes, just as you and Nick, I do hosting, in my case hosting of a PHP/MySQL APP.
I’ve been using separated containers for APP (Apache + PHP) and for MySQL for a long time. I’m now trying to optimize things a bit and learning about ZFS optimization for MySQL.

I was thinking of adding parameters to the 2 profiles App and DB to be able to optimize ZFS separately for each type of load, maybe even add a 3rd profile for the backup server, which usually deals in sequential loads most of the time.

michacassola · February 2, 2023, 4:59pm

I was always wondering about putting the DB in a seperate container, but end up putting everything in one, as I use the containers for chopping up my servers for reselling as LXD VPS.

Did you ever do testing regarding any possible performance gains seperating the app and the db like that?

Just thinking about it I cannot understand where the performance gain should come from.
I use native binaries and unix sockets wherever I can for max. performance. Unix sockets should theoretically be faster as they do not have the overhead of TCP data integrity checks. Of course I do calculations to not give away more ram than the container/machine has to offer.

Anyways, here all the settings for the mysql volume:

lxc storage volume create pool mysql-vol
zfs set recordsize=16k zfs/pool/custom/default_mysql-vol
zfs set logbias=throughput zfs/pool/custom/default_mysql-vol
zfs set primarycache=metadata zfs/pool/custom/default_mysql-vol

And here the general setting on the whole thing:

zpool create -o ashift=12 -o autotrim=on -m none -O compression=on zfs /dev/sda3
zfs set atime=off zfs
zfs set sync=disabled zfs #This one has the highest impact but comes with risks
zfs set recordsize=128k zfs
zfs set xattr=sa zfs
zfs set compression=lz4 zfs

P.S.: I am working on a hosting script, give it a look if you read this
https://github.com/HOSTYON/qycli

Yosu_Cadilla · February 7, 2023, 9:20am

Thank you for sharing your settings @michacassola

I don’t think there’s any performance to be gained by using a separated container for the DB.
As I see it, the main benefit is that you can control and tune the resources given to the DB and the APP separately. You can also define LXD profiles that you can then apply to containers, could make assigning proper values for DB and APP simpler, by including different datasets directly on the profile, hence the extra complexity of managing 2 containers would be compensated by simpler setup. Also, you probably do not need to backup your APP containers as much as your DB containers?

Back to MySQL, it requires (can benefit from) having 2 different recordsizes, theory says 16K for InnoDB files and 128K for the log, the APP containers could use 128K too?
—> Is there a better size than 128K for an Apache+FPM/PHP server?
—> Ideal recordsize for memcached or redis?

Then, because I will have to transfer snapshots to a remote backup server, I could be using a 1K recordsize dataset, for the entire backup server…
Hence, my new worry is, data/file transfers between different recordsizes. I am uncertain about how transferring a file from a 16K dataset to a 1M dataset will behave (or the other way around).
The only “information” I have about this is anecdotal from a post I found; it said that the performance improves noticeably when you properly adjust the recordsize for the workload, but then moving files from different datasets with different recordsize drastically increases CPU load and is also noticeably slower, especially from 16K to 1M and vice-versa.
Since I don’t know what “noticeably” or “drastically” exactly mean, and because this person was comparing desktop speeds on BSD and not server loads on Ubuntu, I don’t really know anything, but it was enough to make me further investigate (and worry about it).

Backup server is a 4x6G Raidz2 located in the same datacenter but different availability zone.
This is the one I am considering setting a unique recordsize=1M for the entire pool.

—> Since the data will be transferred over a network, do I even need to care about performance on the backup server’s pool? I will never need a speed greater than 1Gb/sec, as this is the max speed of the NIC)?

—> Since disk/fs performance (transfer speed) on the backup server is not a critical priority… Should I just optimize for “lower possible load on the origin” and hence use the same dataset sizes used on the parent containers that will be backed up? This way I would reduce any possible friction and possibly reduce (or not increment) the load on the live servers when backing up the containers by moving snapshots of said containers to the backup server?

—> Am I obsessing about something measurable in milliseconds or even nanoseconds that will have no impact or hardly any noticeable impact on my modest setup?

That looks very nice!
Inspired me to start working on a set of bash scripts to manage backups (for starters).
For now I have just a few scripts that I needed badly, but I will try to write some code every week.
I was thinking on using Python to wrap it all and present it either as a web UI or as an API, got plans for something of the sort?

michacassola · February 8, 2023, 4:58pm

Regarding recordsize:

128k seems to be chosen as the perfect balance. I think this is a good read: https://jrs-s.net/2019/04/03/on-zfs-recordsize/
Memcached is irrelevant regarding recordsize as data is stored in memory, redis dumps the in memory data to disk, but I think the default recordsize should be ok for the dump files.

If you don’t put your backups in a large compressed archive file I would setup the backup server with the same recordsizes. You might use it as a fallback this way too if you have LXD on it.