[LXD] Metric exporter for instances

monstermunchkin · July 29, 2021, 11:26am


Project	LXD
Status	Implemented
Author(s)	@monstermunchkin
Approver(s)	@stgraber @elmo
Release	4.19
Internal ID	LX007

Abstract

Add a built-in metric exporter which provides metrics from instances.

Rationale

Currently, stats/metrics are only provided by the /1.0/instances/<instance>/state endpoint. Monitoring software however cannot easily deal with this kind of output. Adding a built-in metric exporter will provide metrics for all instances. Furthermore, users who wish to visualize these metrics can easily do so by using grafana for example.

Specification

Design

Metrics will be available at /1.0/metrics which is where Prometheus for example scrapes the data from. The new endpoint will be accessible by anyone trusted by LXD. Additionally, a metrics type certificate would only allow access to /1.0 and /1.0/metrics. The metrics list would also support filtering by project should the client certificate be project-restricted.

LXD wil gather metrics every time the /1.0/metrics endpoint is called. For VMs, the request will be forwarded to the lxd-agent which will return the metrics. The metrics will show all running instances, and will be distinguishable through the project, name, and type tags. Furthermore, they will be prefixed with lxd_. An example:

lxd_memory_Active_bytes{project="default",name="c1",type="container"} 1024
lxd_memory_Active_bytes{project="foo",name="c1",type="virtual-machine"} 2048

Here’s the list of supported metric names:

lxd_cpu_seconds_total{cpu="<cpu>", mode="<mode>"}
lxd_disk_read_bytes_total{device="<dev>"}
lxd_disk_reads_completed_total{device="<dev>"}
lxd_disk_written_bytes_total{device="<dev>"}
lxd_disk_writes_completed_total{device="<dev>"}
lxd_filesystem_avail_bytes{device="<dev>",fstype="<type>"}
lxd_filesystem_free_bytes{device="<dev>",fstype="<type>"}
lxd_filesystem_size_bytes{device="<dev>",fstype="<type>"}
lxd_memory_Active_anon_bytes
lxd_memory_Active_bytes
lxd_memory_Active_file_bytes
lxd_memory_Cached_bytes
lxd_memory_Dirty_bytes
lxd_memory_HugepagesFree_bytes
lxd_memory_HugepagesTotal_bytes
lxd_memory_Inactive_anon_bytes
lxd_memory_Inactive_bytes
lxd_memory_Inactive_file_bytes
lxd_memory_Mapped_bytes
lxd_memory_MemAvailable_bytes
lxd_memory_MemFree_bytes
lxd_memory_MemTotal_bytes
lxd_memory_RSS_bytes
lxd_memory_Shmem_bytes
lxd_memory_Swap_bytes
lxd_memory_Unevictable_bytes
lxd_memory_Writeback_bytes
lxd_network_receive_bytes_total{device="<dev>"}
lxd_network_receive_drop_total{device="<dev>"}
lxd_network_receive_errs_total{device="<dev>"}
lxd_network_receive_packets_total{device="<dev>"}
lxd_network_transmit_bytes_total{device="<dev>"}
lxd_network_transmit_drop_total{device="<dev>"}
lxd_network_transmit_errs_total{device="<dev>"}
lxd_network_transmit_packets_total{device="<dev>"}
lxd_procs_total

For filesystem metrics, if the filesystem type name is known it will be listed with fstype. Otherwise this tag will contain the hexadecimal value of the filesystem.

API changes

The following new endpoint will be added to LXD

GET /1.0/metrics

The lxd-agent will gain the following new endpoint

GET /1.0/metrics

It will gather metrics and report them back to LXD in JSON format.

CLI changes

No CLI changes.

Database changes

No database changes.

Upgrade handling

No upgrade handling.

Further information

The provided metrics will honor OpenMetrics (https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md).

stgraber · July 29, 2021, 4:55pm

This shouldn’t be needed. LXD will retrieve the needed information from the instances when /metrics is accessed. We don’t want LXD to have to do any work when nothing is scraping it.

So the goal really is that core.prometheus_address (actually core.metrics_address should be better) gets you an HTTP server which implements GET /metrics which then returns you a whole bunch of metrics for the instances running on this machine.

This should be pretty similar to what you get when you do GET /metrics against a node-exporter daemon.

turtle0x1 · July 29, 2021, 5:32pm

How will this work with projects? One api call per project or all projects on one call?

stgraber · July 29, 2021, 5:57pm

All projects in one call. This is really meant to be used with an external prometheus server scraping that data on a fixed interval. Such prometheus servers are normally configured to scrape each of the machines running the service and expect to get metrics for everything on the system.

We’ll tag the relevant metrics so you can easily tell what project an instance is in and can filter projects through prometheus queries.

stgraber · July 30, 2021, 9:19pm

We should add disk related ones (space/inode used, free, total).

An example output from prometheus-node-exporter would be: Ubuntu Pastebin

With the main categories we should aim to cover being:

node_cpu
node_disk
node_filesystem
node_load1 / node_load5 / node_load15
node_memory
node_network
node_procs

Our equivalents are going to be lxd_NAME using the attributes described above (project & name).

Not everything will be possible to cover, at least not initially as we’re constrained by what’s available through the cgroup1/cgroup2 interfaces.

I think a good initial set would be:

lxd_cpu_seconds_total (cpuacct.usage_all)
lxd_disk_read_bytes_total (blkio.throttle.io_service_bytes_recursive)
lxd_disk_reads_completed_total (blkio.throttle.io_serviced_recursive)
lxd_disk_written_bytes_total (blkio.throttle.io_service_bytes_recursive)
lxd_disk_writes_completed_total (blkio.throttle.io_serviced_recursive)
lxd_filesystem_avail_bytes (statvfs)
lxd_filesystem_free_bytes (statvfs)
lxd_filesystem_size_bytes (statvfs)
lxd_memory_Active_anon_bytes (memory.stat)
lxd_memory_Active_bytes (computed)
lxd_memory_Active_file_bytes (memory.stat)
lxd_memory_Active_anon_bytes (memory.stat)
lxd_memory_Cached_bytes (memory.stat)
lxd_memory_Dirty_bytes (memory.stat)
lxd_memory_Inactive_anon_bytes (memory.stat)
lxd_memory_Inactive_bytes (computed)
lxd_memory_Inactive_file_bytes (memory.stat)
lxd_memory_Mapped_bytes (memory.stat)
lxd_memory_MemAvailable (computed)
lxd_memory_MemFree (computed)
lxd_memory_MemTotal (memory.usage_in_bytes / memory.limit_in_bytes)
lxd_memory_RSS_bytes (doesn’t exist in node-exporter, memory.stat)
lxd_memory_Shmem_bytes (memory.stat)
lxd_memory_Swap_bytes (doesn’t exist in node-exporter, memory.stat)
lxd_memory_Unevictable_bytes (memory.stat)
lxd_memory_Writeback_bytes (memory.stat)
lxd_network_receive_bytes_total (netlink)
lxd_network_receive_errs_total (netlink)
lxd_network_receive_packets_total (netlink)
lxd_network_transmit_bytes_total (netlink)
lxd_network_transmit_errs_total (netlink)
lxd_network_transmit_packets_total (netlink)
lxd_procs_total (doesn’t exist in node-exporter, pids.current)

The bulk of those will be done through parsing of cgroup statistics so will need new helpers in lxd/cgroup/ and will need testing on cgroup1 and cgroup2. In general we’ll want one cgroup function per file so we can fetch all the memory stats in one shot, …

The network stats bits we already have functions to fetch that over netlink so linking the data should be easy enough.

The filesystem bits will be a bit trickier and may need expending a bit on GetInstanceUsage and GetCustomVolumeUsage to give us the extra data.

We need to keep in mind that we’ll likely have prometheus instances scraping us every 30s or so and we may be running hundreds to thousands of instances locally, so it will be absolutely critical that we never spawn a subprocess as part of any of this, nor need to access any data outside of kernel filesystems.

The spec should also cover the VM behavior where there we’ll be forwarding the request to the agent in the VM. When we eventually add configuration to have VMs report their host side usage instead, then this will change to instead getting as many of the metrics as possible from QEMU.

As for implementation, I suspect we’ll want to add a GetMetrics() function to the instance driver interface and then implement that in both LXC and QEMU.
That function should effectively return a list of metrics for the instance which will then get all merged together to make the full output of /metrics.

stgraber · August 3, 2021, 9:59pm

So I’ve been thinking a bit about the security aspect of this.

Initially my thought was to just do a core.metrics_address and run a plain HTTP listener on it, but this doesn’t really line up much with our normal stance around security.

Instead, I think we should be running this on a HTTPS endpoint and use TLS client certificate authentication (which prometheus supports) and use a new certificate type to allow metrics-only access.

The metrics would therefore just sit at GET /1.0/metrics and be accessible by anyone trusted by LXD. A metrics type certificate would only allow access to /1.0 and /1.0/metrics. The metrics list would also support filtering by project should the client certificate be project-restricted.

@sdeziel for the upcoming charm integration this would mean that when related to prometheus, we’ll have the leader generate a new keypair, add the public key to the LXD trust store as a metrics-only certificate and then send back the config to prometheus (server certificate, client certificate, client key, URL) so it can then safely scrape LXD.

Given the good security story with this approach, we’d do away with the separate listener, so no more core.metrics_address and matching metrics endpoint in the charm.

@sdeziel @tomp @monstermunchkin @morphis how does that sound?

stgraber · August 4, 2021, 12:25am

Had @elmo go over the list of metrics, ended up adding a couple of hugepages ones (easy to get in VMs, in containers, we can pull the hugetlb cgroup entries for 2MB pages to get limit and usage) and the dropped packets stats (which should be easy to get from netlink).

Other than that, looks like it’s covering what he cares about.

tomp · August 4, 2021, 8:40am

Sounds good security wise. Only thought I had was, could it be that an admin might want the metrics accessible on a different listener than the main API (or perhaps in addition to)? Not a blocker though, as this could be added later, and should still use the same certificate for authentication.

stgraber · August 4, 2021, 1:06pm

Yeah, thought the same. If this comes up as a requirement we can add an endpoint which only provides metrics access, similar to what we did with the cluster address.

sdeziel · August 4, 2021, 1:45pm

Adding TLS to the mix is a good idea and shouldn’t make it that much harder on the prometheus charm side to relate with LXD.

sdeziel · August 4, 2021, 1:50pm

There must be a reason for the mixed case in metric names but I can’t find it

stgraber · August 4, 2021, 1:57pm

I suspect because in most cases it’s how it’s spelled in the source file for the metric (like meminfo for memory).

In general, I’ve lined up our names on those from node-exporter, so it should be identically inconsistent and require minimal changes to people’s existing dashboards

monstermunchkin · August 4, 2021, 2:00pm

I’ve updated the spec to include the suggested changes.

stgraber · August 4, 2021, 2:07pm

Looks good to me, marked as approved.

zekrioca · September 4, 2021, 6:17am

Hi all. I saw the details of release 4.18, but I saw no mentioning about the metric exporter. Was it delayed? Thanks!

monstermunchkin · September 4, 2021, 6:56am

Yes, it was delayed. It’s been moved to the 4.19 release.

stgraber · September 4, 2021, 5:13pm

Yeah, 4.18 was a bit optimistic considering you had two weeks off