Looking to see if any headway has been made to provide per-container stats using LXD. Given the new 3.0 clustering option, we can now pool lots of physical servers into a large cluster potentially running hundreds/thousands of containers. As a result, tracking usage for so many containers is essential to managing large clusters.
We still don’t have any more than what’s available in
lxc info NAME. Recording historical value is tricky because the kernel APIs to retrieve those values are very expensive, making recording the data cause more load than the container would on its own.
There is a third-party tool called Sysdig that can collect aggregate data and understands containers.
sysdig installs a kernel module (through DKMS), and uses that kernel module to get visibility into the processes.
Thanks for the sysdig info. I will check it out.
As you guys can imagine, operational maintenance and life-cycle management of large scale clusters will become a hot topic in the future. This is why tools that can get container stats to identify misbehaving containers (“lxd-top”) are so important.
As we know, it is easy to spin up a few LXD nodes and start running workloads. The problems start when you run lots of workloads and users start complaining of performance issues.
Maybe these tools already exist, and I just have not seem them yet?
My understanding is that sysdig is meant to do exactly that and they have a kernel module to grab the needed information without all the overhead that LXD would otherwise have when interacting with a clean mainline kernel.
You can create a plugin for glances using LXD python API. You can do it by extending the docker monitoring code. For details take a look at http://glances.readthedocs.io/en/stable/aoa/docker.html
no link to your own blog posting about sysdig? Nice short explanation focused on LXD containers
I am longing for the moment to see someone else post any of my blog posts :-).
on request here is a blog post about sysdig to monitor containers… or actually troubleshoot HOWTO: Install and use Sysdig/Falco (troubleshooting and monitoring)
in the meantime I am trying to figure out which cgroup data from /sys/fs/cgroup/… and beyond can be used to get the interesting data. Should be in there some where right? Than it would be a matter to read those files, wich should even be less intrusive I guess