Mounting ZFS datasets to containers

AnonymousAlligator · August 27, 2017, 2:41pm

Hello,

Using LXD I’m trying to make an existing ZFS dataset (with data) (created with zfs create POOL/foobar) available to an lxd container.
The ZFS dataset is meant for a single container and holds data for a single application in the container, and is meant to stay even if I decide to get rid of the container at some point.

So far I found several ways to do it, but none that makes me entirely happy:

creating a zvol, moving the data to the zvol and adding it to the container with lxc config device add myContainer myStorage disk source=/dev/zdxy path=/mnt/smthing
The issue (at least for me) with this approach is, that I have to give the disk size of the zvol instead of just letting it take the space it needs, although the size can be changed manually.
creating an lxd zfs-storage and copying the data to it (as creating from/importing a non-empty dataset isn’t working)
This could certainly work, but the data is somewhat trapped in lxd, as removing the storage form lxd, also removes the backing zfs dataset(s).
There’s also this note: “Note that LXD will assume it has full control over the ZFS pool or dataset. It is recommended to not maintain any non-LXD owned filesystem entities in a LXD zfs pool or dataset since LXD might delete them.”
Setting the mountpoint somewhere on the host, map some UID/GID between host and container and run lxc config device add myContainer myStorage disk source=/mnt/foo path=/mnt/bar

So far this is the most promising way of doing it, but requires me to create a group on the host system. On the plus site: The folder is easily available on the host system.

Are these all the ways to do it or did I miss something that’s more suitable? I just want to mount a zfs dataset to some path in the container, but not giving control over the zfs dataset itself to lxd.

Thanks for you help,

brauner · August 27, 2017, 3:35pm

Hi,

If you already are on a storage API enabled LXD instance (I’d recommend 2.17) or are fine with switching to our feature branch which is currently at LXD 2.17 then there is a more suitable way that will allow you to do what you want. First of all, you need to create a new custom dataset within your storage pool:

lxc storage volume create <pool-name> <volume-name>

You can then simply attach this storage volume to your container by doing:

lxc storage volume attach <pool-name> <volume-name> <container-name> <device-name> path=</some/path/in/the/container>

The nice feature about this is that as long as all the containers that this storage volume is attached to have the same idmappings specified LXD will automatically change the idmapping of the storage volume as well such that your containers have read-write access. Given that you said that these volumes are container specific you will have a correctly mapped custom storage volume for your container available.

brauner · August 27, 2017, 3:42pm

3. creating an lxd zfs-storage and copying the data to it (as creating from/importing a non-empty dataset isn’t working)
This could certainly work, but the data is somewhat trapped in lxd, as removing the storage form lxd, also removes the backing zfs dataset(s).
There’s also this note: “Note that LXD will assume it has full control over the ZFS pool or dataset. It is recommended to not maintain any non-LXD owned filesystem entities in a LXD zfs pool or dataset since LXD might delete them.”

To clarify this point, what we are warning about is that if you delete the storage pool itself and you have datasets in the storage pool that LXD is not aware of, LXD might delete it but usually won’t since we have code to detect it. But we - as of yet - do not want to guarantee that.
This however does not affect custom volumes created directly via the storage API as I oultined before since LXD will know that these volumes exist and so won’t let you delete the storage pool as long as they are around. So you can keep them as long as you want!

AnonymousAlligator · August 27, 2017, 4:55pm

Thanks for your reply.
I forgot to mention that I’m on LXD 2.17 installed with snapd on Debian Stretch.

The way you described is pretty much what I mentioned as 2. It requires me to initially create a storage pool with lxc storage create MyContainerData zfs source=myOtherTank/newPool or use an existing or the default one. It will also create additional volumes like deleted, images,… and so on, that aren’t necessary in this case.
The create command will then just create a new dataset in the just created pool

Why isn’t it possible to just use the lxc storage volume attach <pool-name> <volume-name> <container-name> <device-name> path=</some/path/in/the/container> command given the zfs volume created externally, to basically get an unmanaged (as in lxd only attaches it) storage? It’s handled similarly with e.g. network devices set up outside of lxd.

stgraber · August 28, 2017, 4:23am

“lxc storage volume attach” is effectively just as shortcut for “lxc config device add”, at the API level they are identical.

ZFS is a bit special in that it allows a “source path” to be some fancy string “POOL/DATASET” which isn’t the case for any of the other Linux filesystems. So if you tell LXD that you want “POOL/DATASET” mounted to some path in the container, LXD will tell the kernel to do that, which will fail because the source path doesn’t actually exist.

You can workaround that by using raw.lxc and tell liblxc exactly what you want, including the filesystem, which should get you around this particular problem.

lxc config set <container name> raw.lxc "lxc.mount.entry=POOL/DATASET some/path/in/the/container zfs defaults 0 0"

You are however pretty likely to then hit another ZFS issue which is that ZFS very much doesn’t understand mount namespaces and containers. So if the dataset is mounted anywhere on the host, ZFS will refuse to have it mounted in the container. Similarly, if it’s not mounted on the host and you attach it to the container, you’ll then be unable to mount it on the host.

Even more annoyingly, the mount namespace cleanup code can take a long time. So when rebooting the container, it may be that the filesystem remains mounted in a dead mount namespace, preventing anyone from mounting it for serveral minutes/hours.

Also note that even if all of this is fine and you decide to use that raw.lxc trick. You’ll still need to manually do the initial uid/gid mapping on the host as there’s no kernel magic there to have it map those uids/gids based on where it’s mounted.

stgraber · August 28, 2017, 4:26am

So the short answer here would be:

The easy way with recent LXD is to use “lxc storage volume create POOL SOME-NAME”, then attach it to your container with “lxc storage volume attach POOL SOME-NAME SOME-CONTAINER SOME-DEVICE SOME-PATH”. LXD will do the initial uid/gid mapping for you and you can access it from the host in /var/lib/lxd/storage-pools/POOL/custom/SOME-NAME/
Manually create your dataset through zfs, have it mounted somewhere on the host by setting mountpoint= to some path there. Then use a disk device entry on your container to have it mounted in there “lxc config device add SOME-CONTAINER SOME-DEVICE disk path=SOME-PATH source=HOST-PATH”. In this case, you’ll have to do the initial uid/gid mapping yourself on the host before the container can make use of it.

stgraber · August 28, 2017, 4:27am

There is ongoing kernel work upstream on something called “shiftfs” which will eventually allow us to mount a filesystem in both the host and the container and have uid/gid show up as the same in both. Having the kernel do on the fly translation. But this isn’t there yet and it’s unclear when it will be.

AnonymousAlligator · August 29, 2017, 7:35pm

Thank you for the detailed explanation.

I’m looking forward to the addition in the kernel, but probably won’t see that addition on Debian anytime soon. For now I’ll probably just stick to mounting it somewhere on the host and adding it with lxc config device, seems to be the way fitting best in my scenario without stressing the intended way of doing things too much.

sqllyw · October 12, 2017, 7:09am

how to do the initial uid/gid mapping?