I have been experimenting with an incus cluster, with a bunch of VMs acting as hosts for containers that build and test software.
In some cases, concurrently running compilers, linkers and regression tests can quite heavily load the hosts, and when that is ongoing, I sometimes see cluster members going into an ERROR state due to a heartbeat timeout. (I also saw this due to a flaky connection between members, but that isn’t something incus could fix. )
I am now trying to see if changing the nice(1) priority of the incus daemon to -10 helps to lower the probability of this occurring. I don’t think I want to set the heartbeat timeout to some crazy high value, because it looks very much like the incusd process is simply getting starved for CPU time.
E.g. I now have on all my cluster members an /etc/systemd/system/incus.service.d/override.conf
file, containing:
[Service]
Nice=-10
What is your opinion on this?