LXC Monitoring via SNMP

Monitoring LXC containers via SNMP shows nearly all assigned cores at 100% where the container is not doing anything.
Setup a bunch of LXC containers which are all in alert state with 100% CPU where the physical host itself has very low CPU utilization.
Any idea what needs to be configured to get correct values?
Seems to happen only on LXC - OpenVZ containers and KVM VM’s are always showing correct values of their assigned CPU’s.

Thanks,
cody

Hi!

SNMP is the transport and the collection is performed by a SNMP agent. Does the SNMP agent you are using know about containers?

It is a common issue with monitoring software that have not been updated to understand Linux containers. For example, there are many top utilities but none (AFAIK) understands yet Linux containers.

A package that understands Linux containers is netdata, at https://www.netdata.cloud/
You would use it to investigate an issue that arose after some alert. You install this on the host.

You will probably need to find a SNMP agent that understands Linux containers. You install such an agent on the host.

We are running Debian as OpenVZ and LXC containers and running snmpwalk against an OpenVZ container you got correct values - running the same against an LXC containers you get wrong values.
We evaluated LXC about three years ago but it was not useable at all but we now finally have to move away from OpenVZ 6 and just testing with LXC again.
OpenVZ are also containers, therefore i thought it’s maybe a configuration thing to get LXC reporting the correct CPU values - and it’s only the CPU values we’ve discovered so far, don’t know how containers work in depth but seems they are different on this.

Thanks for mentioning netdata, installed it in a VM to take a quick look on it but it seems only to be a local host monitoring tool rather then a monitoring solution.

There are things which run inside the container and need to be monitored there, don’t know how a monitoring solution would combine things monitored outside on the host and inside a container and graph that up properly.

When CPU resources of LXC containers can only be monitored on the host it seems the only option is to move to KVM then as proper resource monitoring is a must.

Thanks,
cody

I’m not sure I am getting your meaning, is it ‘as CPU resources…’ ? if yes, it’s wrong, nothing is stopping you to setup snmpd inside a container, the whole idea behind LXD is to see containers as ‘small computers’, that is, system containers. The values you will get will be more or less accurate according to the LXC (or more aptly LXCFS) version that you are using, since the proper reporting of used resources is an ongoing project - I may be wrong because I’m not much interested in it, but I think that proper CPU resources accounting is only possible with a very recent (as in less than 3 months ago) version.

As of monitoring Linux containers with SNMP in the host, I don’t think it’s possible. If you take a solution like PRTG for example, they don’t monitor Docker containers with SNMP, they have developed specific sensors with the Docker API. Taking a look at netdata source code it’s the same. Unfortunately SNMP is a legacy technology now.
But I’m pretty confident that anyone could interface a reasonably powerful monitoring solution with the LXD api. It seems easy with netdata or Prtg.

My view on LXC and LXD container monitoring is that it is more efficient to have a service on the host to grab generic info like CPU load. Because , it would be a single service that loops over the hundreds of containers and does not need to enter each one of them. That is how I think it is implemented in netdata.

Netdata gives you high visibility into a system by collecting a few thousand datapoints. The purpose is to use netdata to diagnose a system on demand (starts working as soon as you load the page), and not keep it running 24/7.

SNMP is indeed legacy. If you need to use it, I suggest to look for an agent that can work on the host to grab the generic data that are visible (and easily extractable) from the host.

That’s exactly what was never working - monitoring the CPU via SNMP inside the container returned wrong values. If that got fixed now then we might see it when it’s available in Debian.

Correct, getting that via SNMP on the host will never work, as ling these information is not available to the SNMP agent. Simos mentioned above to install netdata on the host and it would be ‘understand’ Linux containers. So if that’s the case then netdata get’s that information somehow via cgroups, api, …
But netdata seems to be a host only monitoring tool and not a monitoring solution at that point.

Might be true for Windows, Docker, LXC, Apps, … almost every router, switch, printer, ups, … has SNMP implemented for monitoring. Might be SNMP will be really deprecated in the future but i think that’s still far away…

It depends on your environment and what your requirements are to say that’s the right monitoring solution, but netdata seems to be far away from that.

That’s the issue - how would you combine the resource monitoring from the host with that you need to monitor running inside?

SNMP might be legacy but actually the best option - we will see what will come in the future but our current requirements cannot be fulfilled with such options.

How would you identify issues in the past and prevent them from happening again when you are not collecting monitoring history 24/7 ?

Netdata does provide a way to store long term data https://docs.netdata.cloud/backends/ & you can’t predict the future so at some point unless you are deeply interested in hardware stacks and big data its pointless to store to much data (id agree there is a certain retention period)

I think @simos is suggesting your augment your stack, Netdata can in fact collect data from SNMP devices https://docs.netdata.cloud/collectors/node.d.plugin/snmp/#snmp-data-collector

Instead of deploying more SNMP on your network you could use a more advanced technique which will give better insight into your services & devices

I think that if you are willing to use snap, you can get it right now (I assume that the Debian kernel supports the features that LXCFS uses)

Well, no one is going to implement Redfish for legacy hardware, so SNMP will stay as it is for existing devices. OTOH more and more system designers will not bother with implementing SNMP for new devices if it’s too complex. That has been my experience in the past week trying to monitor RAID on servers only a few years old. No SNMP for this particular use, i had to cook up a RedFish interface for Prtg (they don’t support Redfish)

Not getting why do you think monitoring history is pointless and producing too much data?
Maybe your definition of monitoring is different than ours…

In which way it should help us - it would collect the data the exact way as we do now and therefore have exact the same issue.

And yes we could use a more advanced technique if devices would support that and a solution would exist for a large scale environment - but that’s actually not the case.
Sorry, it’s nice to talk about future monitoring techniques but that’s going off-topic for our current requirements - it’s currently out of scope to run multiple monitoring tools to monitor specific devices or even parts of them, it may work in a small environment but not fur us.

Never used snap, but worth taking a look on a test system if’s working - thanks for that hint.

That’s the point - we aren’t going to develop our own monitoring solution because it’s not our business.
If a monitoring solutions do that in a proper way we are interested, but the actual state is that legacy SNMP is the most supported way of monitoring - currently.

If you want that much SNMP, I’m sure there are consulting firms that would be willing to do you a quote to develop a SNMP interface for LXD or Docker or whatever container solution you pick.
In the end that’s why SNMP is dying, nobody is willing to pay for it.

1 Like

I’ld like to add to this. lxcfs is what enables a container to have its own real values for the cpu load, load average, etc. There are several versions of lxcfs, and any newer version can properly virtualize more kernel metrics.

Here is the latest version of lxcfs, 3.1.2, LXCFS 3.1.2 has been released Among the new features, is Add support for per-container cpu usage in /proc/stat and more.

Therefore, the question with monitoring support in a container is converted into a question of Which version of lxcfs am I using in LXC or LXD?, so that I get proper virtualized values for my monitoring tool?

I am running the snap package of LXD, channel stable. I got the latest version of lxcfs.

$ /snap/lxd/current/bin/lxcfs --version
3.1.2

Therefore, when testing with Debian, make sure you know which version of lxcfs you are actually using. And consult the documentation (release notes) on what is covered in that version, if it is not the latest.

To figure out which metrics are virtualized in practical terms, you can setup netdata in a container for testing purposes. Here is how to do it in LXD, https://blog.simos.info/how-to-setup-netdata-in-a-lxd-container-for-real-time-monitoring/ Note that those metrics that are not virtualized in your case, will give you easily visible bogus values.