Recommended ZFS structure? Sharing pool with root vs. separate pool?

carschandler · November 1, 2024, 1:00pm

TL;DR

Can I have my root file system on a ZFS pool and also have Incus use a dataset within that same pool as its storage device? Even if I can safely do this, is there some other reason it’s a bad idea?
Is there a specific ZFS recordsize that is preferred for Incus VMs?

Hello, everyone. I found this post helpful regarding ZFS setup, but I had some additional follow-on questions. I have two SSDs on my system that I plan to utilize: one 500GB for the system OS + Incus to run on and another 4TB for bulk storage. I plan to use ZFS on all of the physical devices. Right now, I’m mainly just concerned with the system/Incus disk. I am going to attempt to run ZFS as my root filesystem for NixOS. My question is, should I set up Incus on a dataset within that pool or leave another partition/block device entirely for Incus to set up on? I can see advantages to both sides, as laid out by the documentation. In particular, it states:

Sharing the file system with the host is usually the most space-efficient way to run Incus. In most cases, it is also the easiest to manage.

This option is supported for… the zfs driver (if the host is ZFS and you point Incus to a dedicated dataset on your zpool).

This made sense to me, and seemed like the easiest way to separate things out. My thought was to have a single zpool with several datasets and just give Incus its own dataset. Then you can easily allow the datasets to expand as needed. However, in the Incus ZFS driver documentation section, it states:

Incus assumes that it has full control over the ZFS pool and dataset. Therefore, you should never maintain any datasets or file system entities that are not owned by Incus in a ZFS pool or dataset, because Incus might delete them.

This confused me because it seems to contradict the previous section, making seem as though Incus should not share the file system with the host because it might delete other entities from the root filesystem that it doesn’t own. Is this just saying that if you configured Incus at the zpool level, then it assumes ownership of the entire pool, so don’t maintain any other datasets/filesystems inside that pool? But if you configured Incus at the dataset level, it assumes ownership only over that dataset, so don’t use that dataset for other things, but it does not assume ownership over the entire zpool, so feel free to use it for other things? If so, then this makes sense, though I think it could be more clearly worded in the docs.

So, my first question is: can I safely set up Incus on a dataset inside a pool shared with the host system, or does it need to be on its own pool?

Lastly, I read from this helpful cheatsheet that the recordsize can be useful to alter for VM storage:

For most database binaries or VM images, 64K is going to be either an exact match to the VM’s back end storage cluster size (eg the default cluster_size=64K on QEMU’s QCOW2 storage) or at least a better one than the default recordsize, 128K.

I looked through Incus’s ZFS driver source code to see if it creates zpools with a certain recordsize, but it seems it just uses default values for most of the options when it runs zpool create. Is there a preferred recordsize for pools used for Incus VMs?

Thanks in advance for your help!

candlerb · November 1, 2024, 2:33pm

Yes. You can create an incus storage pool from an empty dataset within an existing pool. You don’t need to dedicate a whole zpool to incus.

It works fine for me, and I can’t see any reason not to do it. Most interesting settings are at the zfs dataset/zvol level. The zpool is where you configure your vdevs and hence replication (zmirror/raidz/dRAID), therefore if you have different replication requirements for your VMs then they’ll need to be in a different zpool. And you have to get ashift right, but things will suck badly if you don’t, and it usually picks the right value automatically.

I don’t have a simple answer to that, but note that recordsize can be set at the dataset level, so you don’t need to worry about it at the pool level, and you can tune it for different VMs.

Correct.

candlerb · November 1, 2024, 2:39pm

Aside: I would suggest setting up a zmirror between 500GB partitions on the two SSDs (leaving a little bit for /boot and /boot/efi on the first), and a separate 3.5TB zpool for bulk data on the second.

ZFS is checksummed. If it detects a file is corrupted, it will refuse to return it to the caller. But with zmirror, it can identify the correct copy, and use it repair the bad copy. It’s therefore a huge bonus for data availability.

This leaves you with 3.5TB unprotected for “bulk” data. As long as it’s regularly backed up, you should be able to restore any individual files which become corrupted. After a scrub, “zpool status” will tell you which ones they are.

carschandler · November 1, 2024, 3:40pm

Thanks for the tip! Yeah I’m planning on adding a larger HDD for backups of everything but it’s probably not a bad idea to mirror the 500 GB directly and then store additional backups of it on the HDD. I guess the main advantage to having the mirror would be automatic error correction whereas if I just was relying on backups to HDD with no mirror, I’d have to wait to see that a file was corrupted and then go and repair it from backups?

candlerb · November 1, 2024, 3:49pm

Yes: and repairs happen transparently in the background while the system runs, without returning EIO to consumer applications. It’ll repair data inside snapshots too, which is tricky otherwise.

carschandler · November 1, 2024, 3:55pm

Thanks so much for the response! I’m going to proceed with a single pool then and I’ll give Incus an empty dataset to work with. I’ve done some digging on my drive (WD_BLACK SN770) and am having trouble determining what the physical sector size is… fdisk -l is reporting 512B for logical and physical, but I’ve heard that it can report 512 as physical even when it’s not the case… I think I’ll probably just set ashift to 12.

Oh cool, I didn’t realize it could be set at the dataset level. Sounds great!