Retrieve LXD metrics with Telegraf and InfluxDB 2.x

Hello :slight_smile:

With the arrival of the LXD metrics endpoint feature in LXD 4.19, I thought it would be interesting to share a basic configuration to retrieve these metrics via the Telegraf agent. The metrics will be stored in an InfluxDB 2.x instance inside of an Ubuntu 21.10 container.

LXD configuration

Configure the LXD metrics endpoint to listen locally :

$ lxc config set core.metrics_address "127.0.0.1:9100"

InfluxDB configuration

Since I used Influx to test this new LXD feature, here is the basic steps to ensure you can send metrics into this timeseries database.

Telegraf configuration

  • Install Telegraf on your LXD host(s) : Install Telegraf | Telegraf 1.20 Documentation (influxdata.com)

  • Generate a metrics certificate to allow Telegraf to query LXD metrics endpoint :

    $ openssl req -x509 -newkey rsa:2048 -keyout /etc/telegraf/lxd-metrics.key -nodes -out /etc/telegraf/lxd-metrics.crt -subj "/CN=lxd.local"
    
  • Import the newly generated key and certificate into LXD trust store :

    $ lxc config trust add /etc/telegraf/lxd-metrics.crt --type=metrics
    
  • Copy LXD server certificate into the Telegraf configuration directory :

    $ cp /var/snap/lxd/common/lxd/server.crt /etc/telegraf/lxd-metrics-ca.crt
    
  • Change permissions on the previous files to ensure Telegraf can read them :

    $ chown telegraf:telegraf /etc/telegraf/{lxd-metrics-ca.crt,lxd-metrics.key,lxd-metrics.crt}
    
  • Create an input and an output configuration to scrape and send LXD metrics to InfluxDB :

    # /etc/telegraf/telegraf.d/input_lxd.conf
    [[inputs.prometheus]]
      urls = ["https://127.0.0.1:9100/1.0/metrics"]
      tls_ca = "/etc/telegraf/lxd-metrics-ca.crt"
      tls_cert = "/etc/telegraf/lxd-metrics.crt"
      tls_key = "/etc/telegraf/lxd-metrics.key"
    
    # /etc/telegraf/telegraf.d/output_influxdb.conf
    [[outputs.influxdb_v2]]
      urls = ["http://<IP address of your InfluxDB instance>:8086"]
      token = "<token generated on the web interface>"
      organization = "home.lab"
      bucket = "metrics"
    
  • Restart Telegraf to apply configurations :

    $ systemctl restart telegraf
    
  • Wait few minutes and metrics must be reported in the InfluxDB Data Explorer. Go to Explore in the InfluxDB interface to create a basic query :

That’s it, hope this tutorial will be useful :slight_smile:

1 Like

Thanks!
I moved it over to our Tutorials section.

On my side, I’ve configured my Prometheus server to start scraping my LXD servers in my production cluster. I still need to figure out a good Grafana dashboard to show the data though, then we can publish another tutorial for a setup using Prometheus and Grafana :slight_smile:

3 Likes

@stgraber I’m actually looking into monitoring LXD with Prometheus atm, can you share what you already achieved on Prometheus side? I’m starting from 0, so any pointers / configs to look at would be nice.

For the scrape config I’m using:

-   job_name: lxd
    metrics_path: /1.0/metrics
    scheme: https
    scrape_interval: 30s
    scrape_timeout: 5s
    static_configs:
    -   targets:
        - '[2602:fc62:a:101::100]:8444'
        - '[2602:fc62:a:101::101]:8444'
        - '[2602:fc62:a:101::102]:8444'
    tls_config:
        cert_file: /var/snap/prometheus/current/keys/lxd-metrics.crt
        insecure_skip_verify: true
        key_file: /var/snap/prometheus/current/keys/lxd-metrics.key

Which then scrapes my servers every 30s (you need to scrape slower than 15s or you may get the same values twice from the cache).

On the grafana front, I’ve not done much yet. I have a dashboard which lists all projects and instances and shows the process and network usage. So still a long long way to go. I’d love for it to show an overview of resource usage for the selected projects at the top and then show the per-instance usage underneath.

https://dl.stgraber.org/LXD-grafana.json

2 Likes

Thanks much, I’ll explore the new metrics with prometheus and see what I can do on the grafana side of things.

If I might make one suggestion here and add in insecure_skip_verify = true as the ssl certs we just generated aren’t technically valid since they are self-signed. Took me a few minutes to figure out how to get telegraf to actually ask lxd for the metrics :slight_smile:

Something like this:

[[inputs.prometheus]]
  urls = ["https://127.0.0.1:9100/1.0/metrics"]
  insecure_skip_verify = true
  tls_ca = "/etc/telegraf/lxd-metrics-ca.crt"
  tls_cert = "/etc/telegraf/lxd-metrics.crt"
  tls_key = "/etc/telegraf/lxd-metrics.key"

Also if you don’t want telegraf spamming your syslog with useless messages about not being to connect to local influxdb you can comment out the default [[outputs.influxdb]] in /etc/telegraf/telegraf.conf:

image

1 Like

Ah yeah. It’s definitely possible to have LXD run a valid certificate but it’s pretty uncommon so I suspect most folks will have to rely on disabling the verification :slight_smile:

I followed this guide and it worked great… for 30 days! By default, openssl is generating certs with a 30 day expiry. You can extend this when generating the certificate with -days, such as:

openssl req -x509 -days 3650 -newkey rsa:2048 -keyout ./lxd-metrics.key -nodes -out ./lxd-metrics.crt -subj "/CN=lxd.local"