I’ve been using Incus (previously LXD) for about six years as a platform for automated testing of various Linux software systems. Initially, I needed to host a test system comprising numerous CentOS 7 machines, three different database servers, and the ability to quickly spin them up for each test suite and tear them down afterward. LXD was a perfect fit.
I originally developed everything to run on my local Dell Optiplex PC, running Ubuntu 20.04 and the LXD Snap (I’ve since upgraded to Debian 12 with kernels and Incus from Zabbly repos). To run my tests in CI, I asked our IT department for a similar machine hosted in the data center, which I could connect to Jenkins as a node. They provided an Ubuntu VM with eight virtual cores, hosted on an ESXi cluster with approximately 116 other VMs. For a while, this mostly worked; the tests took slightly longer in CI, but that was acceptable.
However, the testing scope expanded to include extremely CPU-intensive tests and other tests that are both CPU and I/O intensive. These now take almost twice as long to run in CI on the VM. We tolerated this, sometimes waiting up to five days for a test cycle to complete.
Now, the testing scope has increased again, and I need to test Windows application software that interacts with a Linux-hosted backend. On my Optiplex, I’ve developed this using Incus to spin up Windows VMs and Linux containers for each test, and it works well. But as soon as I tried running it in CI on the VM, I encountered strange issues with Windows. I can’t seem to complete the Windows installation and image creation on the VMs; it frequently times out and reboots. So, I create the Windows images on my Optiplex and copy them to the VMs—not ideal. Furthermore, when the Windows VMs run on the Ubuntu VM (nested virtualization), performance is much slower, and other test components time out, producing errors I don’t get on my PC.
It seems we’ve reached the limit with the slow Incus VM on an overallocated hypervisor, especially with nested virtualization now involved.
I’m looking for suggestions on what to request from IT? Does anyone else run similar workloads that only run for a few hours at a time, requiring significant performance during those periods but remaining idle the rest of the time? A bare metal Incus machine would be magic, but I don’t think they will agree to that.
Perhaps there is a way to host Incus on a cloud providers VM rather then our own on-prem, and be able to turn on a very powerful VM only when a test needs to be run, and then turn it off the rest of the time. But I guess at that point, I don’t really need Incus anymore and could use something like Terraform.
Maybe just connecting my Optiplex PC to Jenkins as a node is the best solution.