[LXD] Stream lifecycle and log events to Loki

Project LXD
Status Implemented
Author(s) @monstermunchkin
Approver(s) @stgraber @tomp
Release LXD 5.6
Internal ID LX025

Abstract

Allow LXD to stream lifecycle and log events to a Grafana Loki server.

Rationale

Currently, lifecycle and log events can be viewed by running lxc monitor or querying /1.0/events. If running multiple LXD servers (not clustered), there is no way of aggregating the logs of all servers. Streaming lifecycle and log events to a Grafana Loki server solves this problem, and also enables visualizing these on a dashboard.

Specification

Design

Streaming lifecycle and log events to a Grafana Loki server requires the loki.api.url config key to be set which points to the Grafana Loki server. Should the Grafana Loki server require authentication, loki.auth.username and loki.auth.password need to be set for basic authentication. If a CA certificate is needed for the server, it can be specified in loki.api.ca_cert.

The event type(s) and the minimum log level which are to be streamed can be configured using the loki.types and loki.loglevel config keys respectively.

Every streamed event will have two fixed labels app=lxd and type=<lifecycle|logging>. Additional labels can be specified using loki.labels. If any of these labels are found in an LXD event, they will be added as a label to a Loki event, and removed from the log message. All other key-value pairs will be added to the log message like "key1=\"<value1>\" <key2>=\"<value2>\" <message>".

Labels at the metadata level do not have to be prefixed (level, message), the others do (context-class, context-description, context-operation, context-project). This is done to avoid keys of the same name on different levels.

Example:

loki.labels: [context-class, context-instance, context-project]

LXD Event:

location: none
metadata:
  context:
    class: task
    description: Updating images
    operation: 9007385a-4d34-46e3-a676-a4bc355319a5
    project: ""
  level: debug
  message: Started operation
timestamp: "2022-09-02T10:21:42.886203976+02:00"
type: logging

Loki event:

{
  "streams": [
    {
      "stream": {
        "app": "lxd",
        "type": "logging",
        "context-class": "task",
      },
      "values": [
        [
          "1662106902886203976",
          "context-description=\"Updating images\" context-operation=\"9007385a-4d34-46e3-a676-a4bc355319a5\" level=\"debug\" Started operation"
        ]
      ]
    }
  ]
}

Here, context-class is in loki.labels and is therefore added as a label. The keys context-description, context-operation, and level and their values are added to the log message.

API changes

No API changes.

CLI changes

No CLI changes.

Database changes

No database changes.

Upgrade handling

No upgrade handling.

Further information

No further information.

Direct integration with Loki is a good feature. But I hope you will consider a more generic option here.

I use the Vector daemon for log shipping and metrics aggregation.

It has generic socket and StatsD ingestion sources, and both of these have a version of TLS client cert auth.

Some benefits of targeting Vector/Fluent Bit/etc.:

  • one level of indirection lets me ship my events to backends of many type, and several backends simultaenously
  • My Loki instance might not even be routable from my LXD server; perhaps my architecture requires this indirection
  • Log shippers can pre-process/filter events. And while Loki can do this, too, in my experience log shippers are a nicer place to do this configuration, because log shippers are deployed closer to the applications, and log/metric filtering logic is often application specific.
1 Like

Currently our focus is on Loki given that we already have Prometheus integration and a Grafana dashboard, so it all fits pretty nicely.

Other options can usually be handled easily enough through either LXD’s log file or by having LXD’s syslog logger enabled and then capturing the log that way.

It’s not impossible that we may add other options down the line but we also don’t want to end up with a confusing set of options and a lot of individual options to test and validate.

1 Like

@stgraber could you please take a quick look?

Will this make it more difficult to filter by log level in Loki?

Hmm, yeah, I think we definitely want a way to easily get all error or warning or info out of loki.

No, it won’t. The blog post reads:

From early on, we have set a label dynamically using promtail pipelines
for level. This seemed intuitive for us as we often wanted to only show logs
for level=”error”; however, we are re-evaluating this now as writing a query.
{app=”loki”} |= “level=error” is proving to be just as fast for many of our
applications as {app=”loki”,level=”error”}.

I’m confused, what’s the difference between {app="loki"} |= "level=error" and {app="loki",level="error"}? Both seem to have some kind of labels that can be programmatically filtered.

Here, app is a label and level=error is part of the log message.

Here, app and level are both labels.

What does the operator |= mean in this context btw?

It means line contains.

Reading How labels in Loki can make log queries faster and easier | Grafana Labs more it sounds like they are primarily warning against using labels that would result in each log entry having a combined high cardinality (i.e the combination of labels being resulting in many different combinations and many entries in the index).

But it also sounds like whether or not to use labels for certain things depends on environmental factors as well as log amount, as well as the type of queries being run (the blog shows an example of doing a negative search on level meaning that it would need to search across multiple chunks rather than a single chunk).

Might I suggest that we convert our log messages to a map[string]string (including things like log level) and then provide a configuration option that allows mapping certain fields to labels and then the rest be put into the log message itself. Then this can be user configurable and we can provide some sensible/useful defaults.

I don’t see why we’d need to convert our log messages to a map[string]string. We could have something like loki.labels which takes a list of strings. If the event metadata contains any of the configured values, these will be added as labels. I would however suggest two fixed labels which are app=lxd and type=<lifecycle|logging>.

Example:

loki.labels: [class, instance, project]

Event:

location: none
metadata:
  context:
    class: task
    description: Updating images
    operation: 9007385a-4d34-46e3-a676-a4bc355319a5
    project: ""
  level: debug
  message: Started operation
timestamp: "2022-09-02T10:21:42.886203976+02:00"
type: logging

Loki event:

{
  "streams": [
    {
      "stream": {
        "app": "lxd",
        "type": "logging",
        "class": "task",
      },
      "values": [
        [
          "1662106902886203976",
          "description=\"Updating images\" operation=\"9007385a-4d34-46e3-a676-a4bc355319a5\" level=\"debug\" Started operation"
        ]
      ]
    }
  ]
}

mmm yeah maybe a map[string]any then so that the top-level keys become available for inspection rather than a struct. I was thinking that the fields used as labels would then be removed from the log message data so its not duplicated.

Having those fixed labels makes sense to me too.

I still don’t understand why we need to convert any log message. The way I envision this works is:

  1. Receive api.Event using our event server
  2. Decode api.Event.Metadata (it’s JSON encoded)
  3. Create labels from the metadata and context, and add the rest of the key-value pairs to the log message using key=\"value\" ...
  4. Create Loki event and push to server

Absolutely. That’s what the example above also does.

OK if you’re building the log message manually rather than passing it to an encoder then it makes sense to not need a map (from which you can remove items) and instead just skip over the ones you’ve already added as labels.

Regarding the labels, it is possibly that the context contains a key which is also in the metadata.

Fictional example:

location: none
metadata:
  context:
    class: task
    description: Updating images
    operation: 9007385a-4d34-46e3-a676-a4bc355319a5
    project: ""
    level: "foobar"
  level: debug
  message: Started operation
timestamp: "2022-09-02T10:21:42.886203976+02:00"
type: logging

How should we differentiate duplicate keys? My idea would be to use context-level (foobar) and level (debug).

Yep that makes sense to me to prefix them.

Is there a possibility to optionally offer a simple logging file?
Like:
lxc monitor --logging /var/log/events.log --flags ----loglevel -- --pretty --format --type --project ...

from where anyone could parse it with own tool of choice?

Just in case someone doesnt like to go through grafana/Loki/Promtail?

Prometheus node_exporter can parse/import text files --collector.textfile.directory . Also bash sed, php … can do a lot with that text file.
Right now I am fetching 1.0/events with few lines of node websocket and on message append it to a simple text file as well as to an sqlite.db.

Yeah, it’s something that could be added easily enough to lxc monitor, though we’d also need to have it reconnect on disconnection for it to be more reliable.