[LXD] Stream lifecycle and log events to Loki

tomp · September 7, 2022, 5:26pm

On that reconnection point, will lxd be buffering any log messages if the connection to Loki drops temporarily?

stgraber · September 7, 2022, 5:28pm

Well, that part was for the lxc monitor case mentioned above.

For the LXD to Loki case, as Loki will be an internal log handler, we can have the handler buffer or block on the reconnection.

kamzar1 · September 7, 2022, 5:30pm

Would help a lot, as people have different tools of data gathering and processing.
Just to mention, I have a log rotation for that simple text file, later further processed by selecting projects, instances, event type …
I see the issue of reconnection as well, in my case nodemon/pm2 help keep the node websocket up, in case termination it reconnects.
But as for lxd, I have no idea to achieve that.

tomp · September 7, 2022, 5:35pm

Ok I’ll keep an eye out for that in the implementation. Thanks

monstermunchkin · September 9, 2022, 12:54pm

@stgraber and @tomp is the spec OK? If so, we can mark it as approved.

tomp · September 9, 2022, 12:57pm

Are loki.api.cert and loki.api.key files? If so can we name them loki.api.cert_file and loki.api.key_file? For consistency with cephobject.radosgw.endpoint_cert_file.

tomp · September 9, 2022, 12:59pm

Will they also be removed from the resulting log message?

monstermunchkin · September 9, 2022, 1:01pm

Actually, both keys are strings not file paths.

Yes, they will. I just clarified this in the spec.

tomp · September 9, 2022, 1:03pm

I wonder if they should be files. Do we have a precedent of storing certs/keys in the database vs files?

Also is this cert/key a per-cluster-member key or a global key?

tomp · September 9, 2022, 1:07pm

As I am not familiar with Loki protocol, it would be great to describe it at a top level.
Some questions I am thinking about are:

Is it a persistent TCP connection, or opened per event (doubtful but worth checking)?
Assuming its persistent, how will we detect losing a connection (does it support TCP keepalives)?
How will we deal with re-connections? Especially if multiple events are coming through that need to be delivered?
In the case that a connection is closed, how long/how many events will we buffer to redeliver before dropping them?

monstermunchkin · September 9, 2022, 1:18pm

We have private and public keys for rbac: rbac.agent.private_key and rbac.agent.public_key.

All config keys are cluster-wide (lxd/cluster/config/config.go).

No, it’s a Rest API so we call <host>/loki/api/v1/push for each event.

Not persistent.

Each event will cause a POST to the aforementioned endpoint. If the host cannot be reached for whatever reason, we could just retry every X seconds, and discard the event after Y seconds.

See answer above.

tomp · September 9, 2022, 1:22pm

OK makes sense, so for cluster wide config we use key variable settings (which avoids the need to replicate the config files onto each cluster member manually). Cool.

tomp · September 9, 2022, 1:48pm

This surprised me.

It sounds like it wouldn’t perform well, and if we were sending lots of events concurrently we would end up opening many connections to the Loki server, potentially overwhelming it.

So I looked at some of the official clients for Loki and came across Promtail (which is a standalone command rather than a package).

However inside it is a client package we could potentially use:

https://pkg.go.dev/github.com/grafana/loki/pkg/promtail/client

But aside from potentially being able to use it, I was interested in seeing how it managed connections to the Loki server(s).

We can see that the New() function returns a client that internally has a single go routine that handles entries from:

github.com

grafana/loki/blob/v1.6.1/pkg/promtail/client/client.go#L313-L332


      
          func (c *client) Handle(ls model.LabelSet, t time.Time, s string) error {
          	if len(c.externalLabels) > 0 {
          		ls = c.externalLabels.Merge(ls)
          	}
          
          	// Get the tenant  ID in case it has been overridden while processing
          	// the pipeline stages, then remove the special label
          	tenantID := c.getTenantID(ls)
          	if _, ok := ls[ReservedLabelTenantID]; ok {
          		// Clone the label set to not manipulate the input one
          		ls = ls.Clone()
          		delete(ls, ReservedLabelTenantID)
          	}
          
          	c.entries <- entry{tenantID, ls, logproto.Entry{
          		Timestamp: t,
          		Line:      s,
          	}}
          	return nil
          }

and batches them up

github.com

grafana/loki/blob/main/clients/pkg/promtail/client/client.go#L259-L297


      
          	// Initialize counters to 0 so the metrics are exported before the first
          	// occurrence of incrementing to avoid missing metrics.
          	for _, counter := range c.metrics.countersWithHostTenantReason {
          		for _, reason := range Reasons {
          			counter.WithLabelValues(c.cfg.URL.Host, tenantID, reason).Add(0)
          		}
          	}
          
          	for _, counter := range c.metrics.countersWithHostTenant {
          		counter.WithLabelValues(c.cfg.URL.Host, tenantID).Add(0)
          	}
          }
          
          func (c *client) run() {
          	batches := map[string]*batch{}
          
          	// Given the client handles multiple batches (1 per tenant) and each batch
          	// can be created at a different point in time, we look for batches whose
          	// max wait time has been reached every 10 times per BatchWait, so that the
          	// maximum delay we have sending batches is 10% of the max waiting time.

This file has been truncated. show original

It also has the concept of retries with backoff delays too.

So I suspect we should be doing something similar, if not using this client package directly.

monstermunchkin · September 13, 2022, 1:51pm

@tomp I had a look at the client, and we’ll be doing something similar. But that needn’t be mentioned in the spec as that’s specific to the implementation.

tomp · September 13, 2022, 1:52pm

OK thanks. In that case the spec looks good to me.

monstermunchkin · September 13, 2022, 2:02pm

@stgraber is there anything I should add, or can I mark the spec as approved?

stgraber · September 13, 2022, 11:37pm

I think it’s fine.

monstermunchkin · September 14, 2022, 8:16am

Regarding authentication, Loki itself doesn’t do authentication. Instead, they suggest using a reverse proxy. The way the spec in written now, we only support mTLS or no authentication. Should we also add support for basic authentication?

If so, we might want to consider using the following keys:

loki.auth.type (takes "" (none), "mtls", "basic")
loki.auth.cert
loki.auth.key
loki.auth.ca_cert
loki.auth.username
loki.auth.password

stgraber · September 14, 2022, 10:16am

Yeah, I suspect we should probably start with just basic auth, that would make things a bit cleaner and that’s likely what most folks will do as TLS based auth is annoying to setup in something like nginx.

Even if we end up supporting TLS based auth, we wouldn’t need/want the type one as it’d technically be possible to do both, so we should just set basic auth if provided and TLS auth if provided, if both are provided, then do both.

Anyway, for now, I think we can drop the certificate ones and stick with just username and password.

I’d probably do:

loki.api.url => URL to LOKI endpoint
loki.api.ca_cert => If provided, CA cert for server
loki.auth.username => username for basic auth
loki.auth.password => password for basic auth