I was reading the documentation about monitoring incus which was recommending Prometheus and Grafana ( How to monitor metrics - Incus documentation ) and was wondering if there were some “best practices” for Incus+Prometheus+Grafana?
There’s no real best practice: just run Prometheus and Grafana wherever is most convenient for you. I put them in their own incus containers, Ubuntu with snapd removed, with Prometheus installed from the release tarball, and Grafana installed from their apt repository.
Then point prometheus at your incus instance to collect data.
For ease of use, I just expose the metrics on 8444 without authentication or certificate verification. Prometheus scrape config:
Currently I use two different containers, but in the past I’ve had them in the same container. Either works, but since Grafana can be used with other backends (e.g. Loki) I think it makes sense to keep them separate, so that for example I could rebuild prometheus without affecting grafana, or vice versa.
I backup the full container by copying it to a different Incus remote. This might not be the best solution but works for me. A better way of doing it would be to create a data volume and add it to the container. Then you could copy the volume to a different Incus remote or export it as an archive.
Snapshots are not really a backup. They are used for a different purpose and don’t help you if your storage fails.
I found it amazing to configure Grafana to use a single stat panel to show me the lowest available RAM among containers (with the container’s name) and the lowest available disk space among containers. That way I can have only a few alerts, that monitor all containers to see if one of the most important resources fell below the threshold.
I just came up with a query that shows cpu usage per container/vm relative to the cpu limit set on the container/vm.
sum by (name) (rate(incus_cpu_seconds_total{mode=~"user|system", instance="${incus_instance}"}[$__rate_interval])) / count without (cpu) (sum by (name, cpu) (incus_cpu_seconds_total{mode=~"user|system", instance="${incus_instance}"} > 10))
This query filters out unused cpus to get the number of cpus per container/vm, and calculates the cpu usage in relation to active cpus per container. I hope it’ll be useful to someone .