We have a production setup of 5-6 LXD nodes.
Every time there is a new provisioning request, we have a logic in place to look at the container units on each of the hosts and pick the host with the least container units.
We have defined flavors - small(CPU:1 , RAM: 4GB), medium(CPU:2 , RAM: 8GB) and large(CPU:4 , RAM: 16GB). The container units are 1 unit for small, 2 units for medium and 4 units for large in accordance to the ratio of resources allocated to them.
This is a very primal balancing algorithm we have in place but it doesn’t look at the resource utilization. As a result we have a host where all the containers are heavily used and the load becomes too much on it causing us to restart the LXD node to get rid of any zombie processes or the likes of it.
The true solution to this is to have some form a dynamic rebalancing of containers across hosts based on the resource usage. But LXD doesn’t provide anything in this space.
What suggestions do you have to address this issue which I am sure a lot of you have faced already. I like LXD but it makes tasks like this very difficult to solve without spending a lot of time in building a solution to the common problems that usually are a result fo scale.