[LXD] Network ACL logging

Project LXD
Status Implemented
Author(s) @stgraber
Approver(s) @tomp @sdeziel
Release LXD 4.23
Internal ID LX014

Abstract

Implement an API and CLI to retrieve the log entries from network ACLs.

Rationale

When using LXD with network ACLs combined with OVN, it’s possible to set the state of an ACL as logged which then causes a log entry to be made when the rule is hit.

As it stands, there is no way for the user to access that log, so they end up having to ask an administrator to look at the OVN log on every system (if clustered).

Since LXD clusters are becoming more and more common and so is providing unprivileged access to LXD, it makes sense to expose such ACL log information to a user that is allowed to add ACLs but isn’t otherwise allowed access to the servers that run OVN (or to its logs).

This is far from ideal and we should have LXD handle the log parsing and aggregation, providing a user-readable log over the API.

Specification

Design

This one is reasonably straightforward. We need an API which on each ACL endpoint which will cause /var/log/ovn-controller.log to be parsed on every system in the cluster, the log will be scanned for any hit for the particular ACL based on database id and the matching records be returned.

This parsing step will also re-format the log entry to something more readable and standardize the timestamps (should they be on different timezones).

The server handling the user request will aggregate the data and sort it based on timestamp before returning it as plain-text data to the user.

API changes

For this one, we should just need one extra API route.

GET /1.0/network-acls/NAME/log

Accessing that endpoint will cause the log aggregation and the data to be sent to the user as plain-text data.

CLI changes

I think this deserves its own sub-command, so I’d introduce a:

lxc network acl show-log <ACL>

Database changes

None required, we’ll parse things on demand.

Upgrade handling

Not applicable. Worth noting though that this will work on historical data as LXD isn’t the one doing the log collection, OVN is.

Further information

None at this time.

1 Like

Example data from OVN:

2022-01-25T20:19:54.840Z|00303|acl_log(ovn_pinctrl0)|INFO|name="lxd_acl7-ingress-12", verdict=reject, severity=info: tcp,vlan_tci=0x0000,dl_src=00:16:3e:38:dd:28,dl_dst=00:16:3e:81:f8:d6,nw_src=168.138.93.66,nw_dst=45.45.148.3,nw_tos=72,nw_ecn=0,nw_ttl=55,tp_src=48374,tp_dst=22,tcp_flags=syn
2022-01-25T20:20:24.487Z|00304|acl_log(ovn_pinctrl0)|INFO|name="lxd_acl7-ingress-12", verdict=reject, severity=info: tcp6,vlan_tci=0x0000,dl_src=00:16:3e:38:dd:28,dl_dst=00:16:3e:81:f8:d6,ipv6_src=2603:c023:4002:3801::1000,ipv6_dst=2602:fc62:a:1::3,ipv6_label=0x99424,nw_tos=72,nw_ecn=0,nw_ttl=56,tp_src=50860,tp_dst=22,tcp_flags=syn

Our initial implementation would only retain:

  • Timestamp
  • IPv4/IPv6 source address
  • IPv4/IPv6 destination address
  • Protocol
  • Source port
  • Destination port
  • Action

Most likely re-ordering things to be in that order too.

1 Like

@tomp @sdeziel ready for review

Would it be feasible to have a tail like behavior? Maybe through a lxc network acl monitor-log <ACL>?

Being able to see all log events (i.e: lxc network acl monitor-logs) would be nice IMHO.

It’s something we could maybe add later but it would effectively cause every server to start monitoring their log files, parse and stream everything to the client which can be somewhat resource intensive.

The current suggested approach makes it easier to throttle and/or cache data should we end up with load issues on this API.

ACK, thanks.

In the config, the verdict is defined by the action so maybe it would be better to use action for consistency?

My initial hope was to have a way to do this at the network level, so you could do lxc network show-log default and get the log entries for all instances using that network, regardless of ACLs.

What I discovered however is that it’s not how things work in OVN.
The log entries are attached to the specific ACL with no reference to the network, all I get is the lxd_aclX-ingress-Y where X is the ACL ID and Y is the rule number in the ACL.

So without a way to filter by network, the only API that really makes sense is to expose it per ACL.

In theory we could do a bulk retrieval API, but there we’re hitting a bit of a REST issue. I can’t make a /1.0/network-acls/log endpoint as it would clash with an ACL called log. We don’t have that problem with the per-ACL route though.

Indeed, let’s make it action for consistency. Verdict is the OVN terminology in the raw log.

In fact, I’d align them all with the LXD terminology:

  • timestamp
  • source
  • destination
  • protocol
  • source_port
  • destination_port
  • action

Is the 00303 in there a hit counter? If yes, it’d expose it in the translated/relayed logs.

No, feels like a global counter of some kind, that rule only got hit 5-6 times during my test, so if it’s a counter, it’s not a correct one :wink:

Sort of inline with this, does any /all off this belong in /metrics?

Another API makes it like 2 or 3 API devs have to hit to get a complete picture, requiring more specialist software for each scenario (is MASS or JuJu or whatever inline with these goals?)

The result will be a plaintext log so the exact representation is still a bit TBD here.
I may mostly keep the same structure as the OVN log but indeed align with the LXD terminology when possible and removing all the fields that don’t make sense to expose to the user (or which could leak data).

/1.0/metrics exposes counters in the prometheus format.

If OVN would keep global hit counters, then we could expose those through /1.0/metrics but it’s not the case here nor can we build that data as we have no idea when a given log file may get rotated.

Logs (access, audit, …) are pretty often treated separately from metrics and using separate systems to aggregate and search them. In this case the focus is really on providing access to a potentially unprivileged user of a LXD cluster access to the specific logging data for the objects under their control.

So it’s pretty similar to lxc info --show-log or lxc console --show-log in that regard.

When monitoring a LXD deployment, you wouldn’t use that API, you’d directly capture the entire OVN log into your centralized logging system and access it through there since as the administrator for the deployment, it’s fine for you to get to see all of it.

Ah okay I see, isn’t it a pointless API if its parsed at call time with no prior events logged (I.E hit the endpoint at 00:01 but the logs rotated at 00:00 potentially showing no volitions? What scenario is relying on this safe if you aren’t ingesting the data? What’s the default rotation period? Why would I waste time looking at this API vs “directly capture the entire OVN log”)

You’d use that API if you are not a privileged user with access to the servers.

That’s pretty common with LXD clusters run for teams or companies where each individual or team gets a LXD project where they can create instances, profiles, images, networks, network acls, … all on their own.

Those users can still configure ACLs to log but short of having this API, they have no way of getting the data out.

And yeah, the logs can definitely (and most likely will be) rotated at some point, we don’t know when and can’t tell if they were so not a whole lot LXD can do about it but report what’s available at the time.

The most common use case for this is when you just put some new ACLs in place and either are trying to figure out why something isn’t working or want to look at traffic you may have forgotten to allow.

OVN logging itself makes that pretty much unsuitable for auditing purposes as there are no guarantees that every hit will be logged. If OVN gets hammered, log entries will be dropped as it’s a low priority thing.

Having this explained it makes more sense, can I suggest a “real world use case” in the “rationale” section? I see “logging” and think “yum, data for long term storage” but its more like “audit events in last N minutes”.

Well that is just modern software in a nutshell

Expanded the rationale a bit.