Any issues with using Linbit externally for Incus LTS storage pool?

mtheimpaler · February 25, 2025, 12:07pm

So im currently using the LTS version of incus, and ive wanted to utilize Linbit SDS… i know there is a driver being worked on in the stable branch, but im sure itll take time until it hits LTS.

Would setting the storage system outside of incus have any issues running with incus ?

So, if i use Linbit with DRBD and NVMe-oF for deploying a LVM-thin disk across my infrastructure, and then just have incus use it as an LVM storage backend, would i have the ability to do everything that a regular local LVM setup would give me ? (I.e. snapshots, backups, live migrations, etc.)

Im introducing a block device across all my nodes, so technically its ok right ?

stgraber · February 26, 2025, 7:02pm

I thought Linstor didn’t really like multi-writer situations outside of live migrations.

bensmrs · February 26, 2025, 10:08pm

Mounting on two places should be fine (that’s what’s used for live migration), but not more. You’d need something to orchestrate where the volumes are mounted… and that’s precisely what the driver being developed does. Honestly, I wouldn’t go down this road and rather wait a bit for us to finish the driver (we’re still polishing edge-case behaviors).

mtheimpaler · February 26, 2025, 10:24pm

I kinda have to deploy soon and honestly i want to stick to LTS. If i manage the orchestration, it should be fine right?

bensmrs · February 27, 2025, 11:30am

Managing the orchestration will mean creating resources (≈ incus volumes) outside of Incus, mounting them on the satellite node (incus server) of your choice, then starting your instances. In most cases, you won’t be able to do live-migration, so you’ll have to stop instances, unmount and remount resources, move instances and start them again.
But if you really need to do it, well that could work, but it’s definitely not production-grade.

mtheimpaler · February 27, 2025, 11:54am

Wouldn’t just creating one huge LVM block device and having it available on all my satellite nodes (also incus cluster members) take care of all that Jafar

candlerb · February 27, 2025, 1:20pm

You mean treating it as a single device, and using LVM with sanlock within that linstor volume to break it up?

I think it would probably work, but I would worry about LVM extents being aligned correctly with DRBD chunks. If you have VM1 and VM2 using adjacent extends in LVM, and VM1 is writing to the high end of its block device while VM2 is writing to the low end, you don’t want to be causing DRBD replication conflicts.

AFAIK, Linstor doesn’t have a concept of a “primary” replica which all I/O would be routed though. If it did then should be fine, but then all reads and writes would be forwarded across the network to the primary node.

mtheimpaler · February 28, 2025, 12:20pm

Sounds like i should wait. I do have a question for @stgraber … when the driver will be available, will we have the ability to choose NVMe-oRDMA or iSER or NVMe-oTCP ? Will it be automated ? I tried going through the PR but couldnt really figure it out.

bensmrs · February 28, 2025, 1:01pm

I feel the question is more suited to the people developing the driver
The driver will require you to have a LINSTOR satellite node on each Incus server of your cluster. Then, how you manage your LINSTOR cluster is up to you, the driver makes no assumption and doesn’t try to configure anything below resource groups. So you can have your controller nodes where you want, and use NVMEo*anything.

mtheimpaler · March 4, 2025, 9:38am

@bensmrs @candlerb I am still in constant pain thinking about this. I am wanting to stick to LTS for the sake of stability as I will be using this cluster to teach classes at a University.

I guess if you don’t mind maybe helping me think whether its worth the wait or possibly just doing Ceph with NVMe-OF gateway.

I have a 9 node cluster of which 3 nodes will be my primary storage nodes.
Of those 3 nodes , each will carry 8 Micron 7300 MAX U.2 drives (800GB each).

My original plan was to just put all 24 drives in one of our servers and using NVMe-OverRDMA and providing all 24 drives to the other 8 nodes and just using the lvmcluster driver. Unfortunately I think I lose out on too many features for backup and snapshots, and also I just am too open to a single point of failure for the cluster.

incus LTS is incredibllyyy stable and I love it. I would like to stay with it, but I thought about changing to the stable branch with the addition of the linstor driver being implemented.

If I decide to stay with the LTS version, what would you recommend my best course of action be ?

The network is a mix of 40Gb and 100Gb nodes (all Mellanox CX3 Pro and CX6-DX cards with Mellanox SN2700 switch), and will soon be all 100Gb in about 6 months.

Using Ceph as my storage backend is a bit meh in my opinion because I know that NVMe-OverTCP will be CPU intensive as the CX3-PRO does not do any offloading for that, but I know its capable of “SOME” RDMA offloading.

Linbit just seems so nice and easy to use, and so thats why I thought using the linbit driver with DRDB and NVMe-OverRDMA would be my best option.

There is just something I don’t understand from your previous messages…

If I would use 3 storage nodes with Linstor then I would have the storage capacity of a single node with 2 copies , right?

Would it be right to say that if I go about it like that then I could deploy that capacity to all my nodes (that being the storage capacity of 1 storage node which is 8 of the Micron 7300 MAX drives) via whatever protocol I want (NVMe-OverTCP or NVMe-OverRDMA or whatever)?

If those 8 storage drives are deployed to the rest of my nodes as a LVM block storage device , why would I have issues with LVM extents being aligned correctly?

I would like to have the ability that if one of my storage nodes goes down, then one of the other 2 nodes (out of the 3 storage nodes) picks up for the incus cluster and continues on with operations.

Is this correct to think I can achieve this with linstor?

bensmrs · March 4, 2025, 10:24am

Well I’m also using Incus at a University, and the monthly releases are already pretty stable. There are a whole bunch of tests for containers, so if something breaks (which you’re not gonna see that often honestly), it’s mostly gonna be with virtual machines, and often it’s a problem with QEMU and not Incus itself.

Well, features are nice, but only when you really use them. I wouldn’t worry about this aspect. I would definitely worry about having a SPoF for storage though

I honestly think monthly releases are very stable.

Depends on how you configure your LINSTOR resource group. You can have the replication count you want. 1 (which I wouldn’t recommend, leaving you with 3×single-node capacity), 2 (leaving you with 1.5×single-node capacity) or 3 (leaving you with 1×single-node capacity).

I’m not sure I understand the entirety of this. Regarding the protocol, that goes beyond my LINSTOR use case (as my Incus nodes are also storage nodes), so you’ll have to read the docs…

I’m not sure what @candlerb meant by that, so I’ll leave him answer

Yeah that’s basically how LINSTOR is supposed to work. I would make some tests to see the behavior during the transient state though.

candlerb · March 4, 2025, 10:57am

From my ancient knowledge of DRBD (which is from ganeti and DRBD version 8, not version 9 which Linstor uses), DRBD keeps a bitmap for “dirty” areas, those which need resyncing; so after disconnection and reconnection, it knows for each chunk whether it needs to replicate A to B or B to A.

From memory, I had a feeling that these dirty areas were chunks of 128MiB, but I can’t find the information online.

LVM extents are by default 4MiB. So a large chunk like this could cover extents from multiple logical volumes.

However, I could be completely wrong. I found some info saying that the drbd metadata rule-of-thumb is 32KiB per 1GiB. If that were all comprised of the dirty bitmap, it would be one bit per 4KiB, which is very fine-grained and wouldn’t be a problem at all.

There are drbdadm dump-md and drbdmeta dump-md which might help. But on the ganeti / drbd8 systems I have access to, I can’t get drbdmeta ... dump-md to work.

candlerb · March 4, 2025, 11:04am

I found the metadata internals. The quick-sync bitmap is indeed 1 bit per 4KiB, so you should be good to go.

Ganeti creates external metadata in a separate logical volume of size 128MiB. I think that’s where I got that number from.

mtheimpaler · March 4, 2025, 11:19am

@candlerb @bensmrs thank you both for this answer, im very excited now woo hoo !

mtheimpaler · March 4, 2025, 11:22am

@candlerb im sorry if I sound misinformed but would changing LBA format of my drives prove to have any positive results ?

I was wanting to use
LBA Format 2 : Metadata Size: 0 bytes - Data Size: 4096 bytes

but would adding a metadata bit for NVMe help in anyway for the setup ?
LBA Format 3 : Metadata Size: 8 bytes - Data Size: 4096 bytes

candlerb · March 4, 2025, 11:41am

As far as I know, DRBD can’t make use of SSD metadata extensions. Actually, I don’t know of any application which can use them