Emulate Proxmox HA with ceph

Brad_K · February 5, 2025, 11:18pm

I am looking at migrating from proxmox/ceph for Ha to incus. By having a ceph cluster run in parallel with my proxmox cluster, I get HA. If one proxmox server crashes, then the vms can automatically start on an available proxmox node in the cluster. With ceph having x3 replication the vm experiences mere seconds of down time. Is this achievable with incus?

Thanks
Brad

candlerb · February 6, 2025, 9:14am

An incus cluster can be configured to auto restart instances on a different node if one node fails: see cluster.healing_threshold.

However, you risk split brain if the cluster decides one node has gone offline, but actually it is just running slowly, or has lost management network connectivity, and the VM is still running and talking to Ceph.

Furthermore, you should be careful what you want from “high availability”. Clustering adds complexity; both Ceph and Incus clustering have software failure modes which can cause the entire system to lock up. A Ceph bug for example recently caused the entire Zabbly infrastructure to go down.

You have to decide which happens more often: a server that dies unexpectedly, or a software bug? I know where my money is.

If you want true(r) “high availability” then I would do this at a higher level: have two independent Incus nodes (or clusters) and two independent storage pools, and replicate your application, with a HA pair of stateless load balancers in front.

stgraber · February 6, 2025, 9:51pm

These days we check both API responses AND ping to the server, so in most cases a slow server will still be considered online and won’t kick in the auto-healing.