Recommended approach for low priority CPU and I/O intensive LXD operations

Hello,

We encountering various I/O and CPU bottleneck on multiple LXD bare metal servers running production containers. We have linked the I/O spikes to the backup routines that perform snapshots and exports of containers to be replicated on a storage NAS. The issue is that the LXD processes of “publishing” and “exporting” a snapshot seem to hog the disk and compete with the services running inside the containers for disk access. As a result we experience production downtimes during the backup slots.

Our tests to “renice” and “ionice” the backup processes in order to deprioritise them have failed. I suspect this is because both “lxc publish” and “lxc image export” commands are the client part of the deal and that setting a nice priority for them have no impact over the server side of LXD which actually perform the publication and export using its default “high” priority, which provokes the bottleneck.

I am not in favor of playing with nice and ionice over the LXD daemon as this will impact the whole system besides the backup operations.

What would be the recommendation to deprioritise (in terms of CPU and I/O) a specific set of instructions performed by the LXD server?

Best,
Amaury

Re-nicing the LXD process itself should actually be fine, in that it won’t affect the container performance since once those are running, LXD doesn’t do anything to them anymore.

There’s not a whole lot of options right now to tweak things like snapshots and publishing and it will mostly depend on the filesystem you use and potentially the compression algorithm in use too.

If using btrfs, zfs, lvm or ceph, snapshots should be extremely cheap as far as I/O. For the dir backend, that’s a big issue as a full rsync needs to happen, but in such cases, we do have a property to tweak that behavior with rsync.bwlimit which will let you cap rsync’s I/O.

Publishing is likely the problematic operation though as that needs to make a full tarball our of your snapshot and then compress it. If you were running into CPU issues, I’d say to tweak images.compression_algorithm to use something less CPU intensive (possibly even “none”). But we don’t really have anything like that for the tarball creation process itself and as it happens using Go’s implementation of tar (in-process), we don’t even have a subprocess that we can nice appropriately.