Questions about clustered LVM

cjlvu · August 26, 2024, 11:01pm

I have some questions regarding the clustered LVM storage option for incus that I could not find any answsers to in the docs and was not able to do any testing myself.

Is it possible to create snapshots when using clustered LVM storage? The lvmlockd(8) man page mentions that snapshots are possible with an exclusive activation but not when an lv is in shared mode, but I’m unsure which mode is actually used.
How does an incus cluster member deal with a loss of all paths to a shared LUN e.g. when the HBA fails? Will the affected incus host be evacuated or marked as unhealthy?
When exporting shared block storage from a disk array it is common to create multiple LUNs to spread the I/O load across controllers. What is the recommended way to consume these LUNs on the incus side? Aggregate all exported LUNs in a single vg and have a single storage pool or have a vg and storage pool for each LUN?

Thanks a lot in advance!

stgraber · August 27, 2024, 4:41am

Yes, snapshots work properly. We normally run with a shared lock to allow for easy recovery in case a system dies, but we temporarily re-acquire exclusive access during snapshots and disk resize operations.

Depends on usage pattern. If you’re dealing with a lot of small instances that all hit the disk in about the same way, then having a few seperate pools may be useful and gives you a bit more flexibiltiy if you ever need to reconfigure things.

If you’re dealing with a few instances doing the bulk of the disk access, then having LVM stripe the storage together into a single VG is probably the way to go.

cjlvu · August 28, 2024, 11:19pm

Thanks for the explanation. Regarding question 2: I looked at some of the code and from what I was able to gather, cluster members get fenced solely based on network heartbeats or lack thereof, is that right?

stgraber · August 28, 2024, 11:27pm

Yeah, that part is always tricky. Our auto-healing support indeed triggers only when Incus stops responding AND the server stops responding to ICMP (ping).

We’re doing things that way because of the significant risk that comes from starting an instance on another server when the original server isn’t fully offline.

We actually recommend our production users integrate with the Incus event API to send a full system shutdown to their BMC or PDU when a server was detected as dead, ensuring that the machine cannot possibly still be running the workloads.

I know that there is a condition in which sanlock will trigger the kernel/hardware watchdog and effectively force a reboot of the system when its locks timeout. That may help with this kind of situation but it takes quite a while before it kicks in.

cjlvu · August 29, 2024, 12:20am

Makes sense to treat automatic instance restarts cautiously. The sanlock watchdog integration seems interesting. Perhaps it is also possible to have the BMC reset or shutdown the host when it detects a HBA hardware failure which would in turn trigger the existing auto-healing.

Thanks again for your insight.