[LXD] Network ACL logging

stgraber · January 25, 2022, 8:32pm


Project	LXD
Status	Implemented
Author(s)	@stgraber
Approver(s)	@tomp @sdeziel
Release	LXD 4.23
Internal ID	LX014

Abstract

Implement an API and CLI to retrieve the log entries from network ACLs.

Rationale

When using LXD with network ACLs combined with OVN, it’s possible to set the state of an ACL as logged which then causes a log entry to be made when the rule is hit.

As it stands, there is no way for the user to access that log, so they end up having to ask an administrator to look at the OVN log on every system (if clustered).

Since LXD clusters are becoming more and more common and so is providing unprivileged access to LXD, it makes sense to expose such ACL log information to a user that is allowed to add ACLs but isn’t otherwise allowed access to the servers that run OVN (or to its logs).

This is far from ideal and we should have LXD handle the log parsing and aggregation, providing a user-readable log over the API.

Specification

Design

This one is reasonably straightforward. We need an API which on each ACL endpoint which will cause /var/log/ovn-controller.log to be parsed on every system in the cluster, the log will be scanned for any hit for the particular ACL based on database id and the matching records be returned.

This parsing step will also re-format the log entry to something more readable and standardize the timestamps (should they be on different timezones).

The server handling the user request will aggregate the data and sort it based on timestamp before returning it as plain-text data to the user.

API changes

For this one, we should just need one extra API route.

GET /1.0/network-acls/NAME/log

Accessing that endpoint will cause the log aggregation and the data to be sent to the user as plain-text data.

CLI changes

I think this deserves its own sub-command, so I’d introduce a:

lxc network acl show-log <ACL>

Database changes

None required, we’ll parse things on demand.

Upgrade handling

Not applicable. Worth noting though that this will work on historical data as LXD isn’t the one doing the log collection, OVN is.

Further information

None at this time.

stgraber · January 25, 2022, 8:35pm

Example data from OVN:

2022-01-25T20:19:54.840Z|00303|acl_log(ovn_pinctrl0)|INFO|name="lxd_acl7-ingress-12", verdict=reject, severity=info: tcp,vlan_tci=0x0000,dl_src=00:16:3e:38:dd:28,dl_dst=00:16:3e:81:f8:d6,nw_src=168.138.93.66,nw_dst=45.45.148.3,nw_tos=72,nw_ecn=0,nw_ttl=55,tp_src=48374,tp_dst=22,tcp_flags=syn
2022-01-25T20:20:24.487Z|00304|acl_log(ovn_pinctrl0)|INFO|name="lxd_acl7-ingress-12", verdict=reject, severity=info: tcp6,vlan_tci=0x0000,dl_src=00:16:3e:38:dd:28,dl_dst=00:16:3e:81:f8:d6,ipv6_src=2603:c023:4002:3801::1000,ipv6_dst=2602:fc62:a:1::3,ipv6_label=0x99424,nw_tos=72,nw_ecn=0,nw_ttl=56,tp_src=50860,tp_dst=22,tcp_flags=syn

Our initial implementation would only retain:

Timestamp
IPv4/IPv6 source address
IPv4/IPv6 destination address
Protocol
Source port
Destination port
Action

Most likely re-ordering things to be in that order too.

stgraber · January 25, 2022, 8:35pm

@tomp @sdeziel ready for review

sdeziel · January 25, 2022, 9:48pm

Would it be feasible to have a tail like behavior? Maybe through a lxc network acl monitor-log <ACL>?

sdeziel · January 25, 2022, 9:52pm

Being able to see all log events (i.e: lxc network acl monitor-logs) would be nice IMHO.

stgraber · January 25, 2022, 10:12pm

It’s something we could maybe add later but it would effectively cause every server to start monitoring their log files, parse and stream everything to the client which can be somewhat resource intensive.

The current suggested approach makes it easier to throttle and/or cache data should we end up with load issues on this API.

sdeziel · January 25, 2022, 10:14pm

ACK, thanks.

sdeziel · January 25, 2022, 10:15pm

In the config, the verdict is defined by the action so maybe it would be better to use action for consistency?

stgraber · January 25, 2022, 10:18pm

My initial hope was to have a way to do this at the network level, so you could do lxc network show-log default and get the log entries for all instances using that network, regardless of ACLs.

What I discovered however is that it’s not how things work in OVN.
The log entries are attached to the specific ACL with no reference to the network, all I get is the lxd_aclX-ingress-Y where X is the ACL ID and Y is the rule number in the ACL.

So without a way to filter by network, the only API that really makes sense is to expose it per ACL.

In theory we could do a bulk retrieval API, but there we’re hitting a bit of a REST issue. I can’t make a /1.0/network-acls/log endpoint as it would clash with an ACL called log. We don’t have that problem with the per-ACL route though.

stgraber · January 25, 2022, 10:19pm

Indeed, let’s make it action for consistency. Verdict is the OVN terminology in the raw log.

sdeziel · January 25, 2022, 10:24pm

In fact, I’d align them all with the LXD terminology:

timestamp
source
destination
protocol
source_port
destination_port
action

sdeziel · January 25, 2022, 10:25pm

Is the 00303 in there a hit counter? If yes, it’d expose it in the translated/relayed logs.

stgraber · January 25, 2022, 10:31pm

No, feels like a global counter of some kind, that rule only got hit 5-6 times during my test, so if it’s a counter, it’s not a correct one

turtle0x1 · January 25, 2022, 10:32pm

Sort of inline with this, does any /all off this belong in /metrics?

Another API makes it like 2 or 3 API devs have to hit to get a complete picture, requiring more specialist software for each scenario (is MASS or JuJu or whatever inline with these goals?)

stgraber · January 25, 2022, 10:32pm

The result will be a plaintext log so the exact representation is still a bit TBD here.
I may mostly keep the same structure as the OVN log but indeed align with the LXD terminology when possible and removing all the fields that don’t make sense to expose to the user (or which could leak data).

stgraber · January 25, 2022, 10:36pm

/1.0/metrics exposes counters in the prometheus format.

If OVN would keep global hit counters, then we could expose those through /1.0/metrics but it’s not the case here nor can we build that data as we have no idea when a given log file may get rotated.

Logs (access, audit, …) are pretty often treated separately from metrics and using separate systems to aggregate and search them. In this case the focus is really on providing access to a potentially unprivileged user of a LXD cluster access to the specific logging data for the objects under their control.

So it’s pretty similar to lxc info --show-log or lxc console --show-log in that regard.

When monitoring a LXD deployment, you wouldn’t use that API, you’d directly capture the entire OVN log into your centralized logging system and access it through there since as the administrator for the deployment, it’s fine for you to get to see all of it.

turtle0x1 · January 25, 2022, 10:47pm

Ah okay I see, isn’t it a pointless API if its parsed at call time with no prior events logged (I.E hit the endpoint at 00:01 but the logs rotated at 00:00 potentially showing no volitions? What scenario is relying on this safe if you aren’t ingesting the data? What’s the default rotation period? Why would I waste time looking at this API vs “directly capture the entire OVN log”)

stgraber · January 25, 2022, 10:55pm

You’d use that API if you are not a privileged user with access to the servers.

That’s pretty common with LXD clusters run for teams or companies where each individual or team gets a LXD project where they can create instances, profiles, images, networks, network acls, … all on their own.

Those users can still configure ACLs to log but short of having this API, they have no way of getting the data out.

And yeah, the logs can definitely (and most likely will be) rotated at some point, we don’t know when and can’t tell if they were so not a whole lot LXD can do about it but report what’s available at the time.

The most common use case for this is when you just put some new ACLs in place and either are trying to figure out why something isn’t working or want to look at traffic you may have forgotten to allow.

OVN logging itself makes that pretty much unsuitable for auditing purposes as there are no guarantees that every hit will be logged. If OVN gets hammered, log entries will be dropped as it’s a low priority thing.

turtle0x1 · January 25, 2022, 11:10pm

Having this explained it makes more sense, can I suggest a “real world use case” in the “rationale” section? I see “logging” and think “yum, data for long term storage” but its more like “audit events in last N minutes”.

Well that is just modern software in a nutshell

stgraber · January 25, 2022, 11:12pm

Expanded the rationale a bit.