Per container performance metrics

Anybody know how can we get per-container performance metrics? I have a large number of containers and need to get metrics like:

  • Top container users (CPU, RAM, memory)
  • Sorted list of all containers based on the criteria above
  • Container usage as a relative percentage of overall activity
  • etc

This is on LXD 2.12 (U16.04 and U17.04)

Our API exposes a number of performance metrics, specifically:

  • Total used CPU time
  • Disk usage (for root device)
  • Memory usage (current and peak)
  • Swap usage (current and peak)
  • Network usage (bytes/packets sent/received)

That’s at the /1.0/containers/NAME/state endpoint and exposed in CLI by “lxc info NAME”.

We can add more stuff to this API as needed, so long as it’s something we can reasonably extract from CGroups or similar. Note that fetching this stuff is unfortunately very expensive, so fetching the information every 5 minutes or so is fine, but fetching it every second or so (like top would do) would cause so much load on your system that your results would be seriously affected.

Thanks Stephane,

I understand the issue about getting data every second. However, is it possible to create a mechanism (like RRD tool) to collect container stats (ie: every 3 or 5mins) that can be viewed offline?

Sure, you can use the “lxc info” output or the raw data from our API, poll that every x minutes and feed that to munin/prometheus/whatever other system you may want to use for graphing.

I’m interested in this as well. Unfortunately I’m not getting everything from the API call to /containers/NAME/state that I’m hoping for:

"disk": {},
"memory": {
		"usage": 512438272,
		"usage_peak": 671420416,
		"swap_usage": 0,
		"swap_usage_peak": 0
"cpu": {
		"usage": 229143956967

My containers are using a non-default storage pool in ZFS; is that why this is blank? Also, what does CPU usage represent: total processing time in seconds? Is there a way to get load, such as what top might show?

@aaronvegh can you show lxc config show --expanded NAME and lxc storage list?

Sure thing!

$ lxc config show --expanded aaron-project-manager 
architecture: x86_64
  limits.cpu: "2"
  limits.memory: 256MB
  limits.memory.enforce: soft
  user.token: 26547235-6aff-4e06-9f02-fbf7d555f63d
  volatile.base_image: 00fda1c45adafff66d5cdd3e8917b29083e17a18a1d0befb8ae8beb82cd44734
  volatile.eth0.hwaddr: 00:16:3e:bc:2d:7a
  volatile.idmap.base: "0" '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
    path: /
    pool: default
    type: disk
ephemeral: false
- default
stateful: false
description: ""

$ lxc storage list
|  NAME   | DESCRIPTION | DRIVER |                   SOURCE                   | USED BY |
| beta    |             | zfs    | BetaPool                                   | 2       |
| default |             | btrfs  | /var/snap/lxd/common/lxd/disks/default.img | 10      |

Ok, so looking at the code, the way disk usage works on btrfs is that it requires a quota group and as those can be expensive, LXD only sets them up when you set a size on your root device.

So it looks like that if you were to set a size on your root device, then the usage reporting should work.

This is btrfs-specific, zfs doesn’t have that problem and all other backend types just plain lack usage reporting.

Right as usual! I thought I was running my containers on the ZFS storage pool, but it was int the btrfs one. Moving to the ZFS pool shows the disk usage. Thanks again!