Incus service priority and heavily loaded hosts

dimitry-unified · May 1, 2024, 6:43pm

I have been experimenting with an incus cluster, with a bunch of VMs acting as hosts for containers that build and test software.

In some cases, concurrently running compilers, linkers and regression tests can quite heavily load the hosts, and when that is ongoing, I sometimes see cluster members going into an ERROR state due to a heartbeat timeout. (I also saw this due to a flaky connection between members, but that isn’t something incus could fix. )

I am now trying to see if changing the nice(1) priority of the incus daemon to -10 helps to lower the probability of this occurring. I don’t think I want to set the heartbeat timeout to some crazy high value, because it looks very much like the incusd process is simply getting starved for CPU time.

E.g. I now have on all my cluster members an /etc/systemd/system/incus.service.d/override.conf file, containing:

[Service]
Nice=-10

What is your opinion on this?

RandomUser · May 3, 2024, 2:48am

Why you don’t use more advanced resources limiting via cgroups2?

BTW, how do you recover the container from ERROR? I recently had to restart whole incusd to recover from this, but as it was a rare case, I did not dig into that deeper.

dimitry-unified · May 3, 2024, 9:42am

I hadn’t thought of that, but indeed, that could be a good way of limiting the number of build processes that are spawned. It’s not always predictable though how much memory compilers and linkers consume, so if you limit the amount of memory per container, you will more often run into transient OOM errors.

Regarding the recovery, I noticed that a particular cluster member goes into ERROR state, but at some later point it always seems to recover by itself? As soon as the load goes down the communication apparently gets established again.

I don’t know how often incus members retry contacting other cluster members, and whether a cluster member at some point is considered “permanently offline”, in the sense that you have to do manual steps to bring it up again.

I would guess that incus cluster recover $offline_member should be enough to bring it up again?

xarufagem · May 3, 2024, 11:26am

Hello,

I believe that those are controlled by cluster level config :

cluster.healing_threshold (default to 0)

cluster.offline_threshold (default to 20s)

This means, if i’m correct, and by default :

After 20 seconds (cluster.offline_threshold) an host is considered as offline, if it has sent no heartbeat.
After 0 seconds of Offline State, it would consider the host “Permanently Offline” and evacuate it, according to cluster config. (it is disabled by default)

I’m running a test cluster, span across 4 sites, connected using xDSL / FTTH / FTTO, simple WG setup, with encapsulated L2 VXLAN for OVN , i had to change those values to something less “LAN” :

cluster.offline_threshold = 300
cluster.healing_threshold = 5400

This way, hosts are considered offline after 5 minuts (300 secs), and are auto-evacuated after 90 minuts offline, after which, you may need to enable it again by issuing “incus cluster restore ”

Had a few problems with ISP outages, updates, and temporary loss of connectivity, it works pretty well for a month

On a side note, when your servers are updating (depending on your update scheme), if using Ansible, or something like Landscape, it may happen, that incus package is updated… and doesn’t always get back up again.

To avoid this, it is recommended in the doc, to proceed manually, one by one :

evacuate cluster member
update incus on evacuated cluster member
reboot the member, and restore it

As stated above, limiting amount of ressources containers are allowed to use may be easyer to setup… and avoid ressources starvation due to high load on many containers, thinking each, they can use 100% of virtual hardware, as it’s not the case

Regards,
/joen