LXD zfs and ceph strategy

Hi all,

Lxd version : snap version 3.2
Host : ubuntu 18.04

I am trying to figure out the best backup strategy for my Lxd cluster.
I have a 5 node lxd cluster with 8 disks each. Out of these 8 disks 1 is for system, 2 are for lxd zfs pool and rest 5 are for ceph cluster.
I have added a zfs pool as storage for each host (lxd-local) and a ceph storage common to all as (lxd-remote).
The idea is zfs pool will be used to run non-essential containers and ceph storage will be used to host images, backups and important containers.
I understand that using same Ceph Rbd pool in all hosts is not supported. My questoins are

  1. How can I keep my images on ceph pool and use zfs pool to run them?
  2. If i create an rbd pool for each host on ceph cluster can i use the same cluster for all hosts?
  3. If i use ceph rbd pool and for some reason i need to rebuild the host again how can I do lxd init to use the already existing ceph storage rbd pool.

Thanks,
Shantur

When using CEPH with clustering, you must use the same pool on all nodes.
When rebuilding a host, it will just use that same pool again without any particular action being needed.

As far as LXD images are concerned, what matters to LXD is what’s in /var/lib/lxd/images on the various hosts, the images stored in ZFS or CEPH are there only for optimized container creation time.

@stgraber: Thanks for your reply.

When using CEPH with clustering, you must use the same pool on all nodes.
When rebuilding a host, it will just use that same pool again without any particular action being needed.

In an event of a node failure, can i just restart the container running on failed node on a different node using same ceph cluster?

As far as LXD images are concerned, what matters to LXD is what’s in /var/lib/lxd/images on the various hosts, the images stored in ZFS or CEPH are there only for optimized container creation time.

Is it possible to mount cephfs to /var/lib/lxd/images to all hosts? In this way the images will be more resilient to node failure.

Re restarting a container on a different node after a failure, yes, you can. You need to run "lxc move --target " before starting again the container. See this test https://github.com/lxc/lxd/blob/master/test/suites/clustering.sh#L434.

Re mounting cephfs for images, I think so, but @stgraber or @brauner should be in better position to confirm that.

Re restarting a container on a different node after a failure, yes, you can. You need to run "lxc move --target " before starting again the container. See this test incus/test/suites/clustering.sh at main · lxc/incus · GitHub.

Thats awesome. Thanks

@stgraber or @brauner

Is it possible to mount cephfs to /var/lib/lxd/images to all hosts? In this way the images will be more resilient to node failure.

I don’t think that’ll work quite as well as you’d think. Moving /var/lib/lxd/images to network storage will not have LXD know that the image is available on all nodes, so when creating a container on a node that doesn’t have the image in its database, LXD will attempt to download the image from another node, treating the existing file on disk as a likely broken artifact from a previous download and will delete it.

This will in turn prevent the download from another cluster node as that node will no longer have access to the file.

Even if we changed LXD to somehow handle this case by for example attempting to validate the on-disk file rather than just throw it out as likely bad, this would still expose you to a lot of race conditions as the individual nodes don’t know whether another node is performing an image update or is already downloading the same image they are.

This could then easily result in multiple nodes attempting to write to the same image file, causing corruption.

@stgraber
Understood. It may lead to issues.

What is the suggested way of securing private images?

At this point there are two strategies:

  • Back them up somewhere (using lxc image export)
  • Make sure that they’ve been used on two or more nodes so you can always recover them from another one

There’s definitely some work we should be doing about image replication within the cluster, most users only have cached images from an external remote, so it’s not that relevant in that case, but locally published images should have some replication mechanism to ensure that we always have them on at least 3 nodes or so.

If lxd can have a storage pool configured for images (like we do for volumes) would make it really sweet.

I don’t think that’ll work quite as well as you’d think. Moving /var/lib/lxd/images to network storage will not have LXD know that the image is available on all nodes, so when creating a container on a node that doesn’t have the image in its database, LXD will attempt to download the image from another node, treating the existing file on disk as a likely broken artifact from a previous download and will delete it.

This will in turn prevent the download from another cluster node as that node will no longer have access to the file.

@stgraber: Thinking more on it, if i mount /var/lib/lxd/images on a different share for each host would work, isn’t it? I believe ceph’s built in dedup will make sure most of the data is written only once.

Yeah, that should be fine