CentOS 7.x containers on a CentOS 7.6 LXC host providing incorrect CPU utilization

vanfawx · May 22, 2019, 5:59pm

I can only reproduce this on CentOS 7.x containers running on a CentOS 7.6 host, running elrepo kernel 4.20.13, and using lxcfs 3.0.3, lxd 3.11.0 and lxc 3.1.0.

When running top, processes will randomly show 200% CPU, 400% CPU. I only have this issue with CentOS 7.x containers. CentOS 6.x, 5.x and Debian containers show CPU usage properly.

Also I have this issue regardless of whether I spin up a new CentOS 7.6 container via the LXD provided images, or migrate a CentOS 7.x container from openvz or a VM.

Please let me know any extra information needed to help track this down.

Thanks in advance!

vanfawx · May 28, 2019, 4:20pm

Sorry for the delay, but here’s some extra details.

The issue is that processes are reporting might higher CPU than they’re actually using. Here’s output from graphite. app-7 and app-8 are CentOS 7.4 64bit containers running on a lxd 3.11 host, with lxcfs 3.0.3 pulled from master about 1 month ago.

Even when there’s ZERO load on the boxes, processes are reporting that they’re using more CPU then they actually are.

Here’s the output from top on app-7: https://pastebin.com/Wh5x85JQ - Even though the summary line is reporting 100% idle, java is reporting 200% used and top 100% used. Even though there’s no actual load on the box.

Here’s the output from top on a container spun up from the lxd images repo: https://pastebin.com/Whh6T7Ha - same issue, with a vanilla centos 7 container spun up from images:centos/7/amd64 - top process reporting 100% cpu, even though it’s not using it.

Here’s the output from htop on app-7: https://pastebin.com/JPhCv9dn - It does the same thing, reporting cores are fully loaded when the process is actually idle, or only using a few % of CPU, not hundreds of %.

I’ve tried building lxcfs from master today, same issue. I’ve tried building 2.0.8, and while I can trigger the same issue, it seems to happen less frequently.

I’m not sure how to debug this further. But so far, the issue only exists on CentOS 7 containers. I cannot reproduce it on CentOS 5, 6 or Debian containers.

Any help would be appreciated. Thanks in advance!

vanfawx · May 29, 2019, 3:54pm

I’ve build lxcfs 3.0.3 from fc30, and still having the same issue of CPU utilization being reported incorrectly inside the container.

http://rpmfind.net/linux/RPM/fedora/30/x86_64/l/lxcfs-3.0.3-3.fc30.x86_64.html

Still the same issue of CPU utilization inside the container being incorrectly calculated, showing processes using much more CPU than in reality. This makes it difficult to monitor containers for CPU utilization.

https://pastebin.com/ZpBucvWk

Top process is showing using 100% CPU, but the cores show 100% idle.

Thanks in advance to anyone willing to take a look at this.

vanfawx · May 29, 2019, 5:01pm

I can also reproduce this using the RPMs built here - https://copr-be.cloud.fedoraproject.org/results/ganto/lxc3/epel-7-x86_64/ - This doesn’t seem to be an issue with my packaging, as it’s happening on other rpms produced by other people.

tomp · June 4, 2019, 2:26pm

Does this still occur if you run lxcfs without the -l flag?

vanfawx · June 4, 2019, 2:32pm

It does. It also occurs with lxcfs 2.0.8 which doesn’t have the -l flag, though less frequently.

tomp · June 4, 2019, 5:30pm

As discussed over E-mail, I have re-produced this issue on both CentOS 7 and Ubuntu 18.04 hosts, so it seems like a general issue with lxcfs rather than a specific problem with CentOS 7 hosts.

The trick to re-producing it, as you pointed out, is to press space bar whilst inside top every couple of seconds and you will see the top process shows as 100% usage. This does not happen if lxcfs isn’t running on the host.

vanfawx · June 4, 2019, 11:09pm

And if there’s actual workload going on in the container, you’ll get X00% where X is usually the number of active threads inside the process. Logstash will usually show 400% alternating between its actual usage. And as you say, not running lxcfs causes the process usage to be reported correctly.

Thanks for confirming this Tom! I had already logged a github ticket as well at https://github.com/lxc/lxcfs/issues/284