Find network usage for a container

erik_lonroth · December 15, 2022, 5:57pm

I’ve got a grafana dashboard up for monitoring 4 hosts with LXD.

My network is a “unmanaged” bridge.

One or a few containers are using up alot of bandwidth, but I can’t tell which ones.

Is there a dashboard that would display this information so that I can try to apply some rate limits on those, as they are now bogging down performance on my network?

The current dashboard looks like this, but has no “top network users”

Must urgent, is that I would need some help to track down the most network-hungry containers.

@Joakim_Nyman

cemzafer · December 16, 2022, 6:54am

Hi Erik,
It is not an elegant way but you can realize which container use much more bandwith with a simple bash script.
for k in $(lxc list -f csv -c n); do echo "----- $k -----"; lxc info $k | grep -i eth0 -A 7 | grep -i Bytes -A 2;done
Regards.

tomp · December 16, 2022, 8:08am

Maybe @sdeziel can help with this?

erik_lonroth · December 16, 2022, 8:22am

Thanx alot, but I would need to graph this somehow over time, so a grafana-prometheus solution would be the only manageable situation…

I also have the same situation for “packets” where I have a lot of TX drops atm which I would need to track down the containers which have this situation.

The current LXD dashboard has been helpful so far, but not all the way. Below you can see the packet issue @sdeziel

sdeziel · December 16, 2022, 10:17pm

The top project overview panel become a little crammed but I’m happy to add more graphs. ATM, we show top CPU, RAM and rootfs usage for a project. Adding pie charts for network and disk bandwidth would make sense.

Those each have 2 metrics (TX/RX and read/write) and I was wondering if they should be added to get the combined traffic (in bps) or go with a pie chart for TX and another for RX. I’m leaning toward the combined chart but I’m open to suggestions.

erik_lonroth · December 18, 2022, 5:03pm

Absolutely. The dashboard would be great to be able to group-by-lxd-host or a means to narrow scope of visualization to host-level issues. For example when storage, cpu or network activities grow in a single host etc. When running thousands or hundreds of containers across multiple hosts, this becomes challenging and using the dashboard to make the analysis is really good.

erik_lonroth · December 18, 2022, 5:04pm

I’m not sure, but also network errors/drops would be good as this is an early indication of major and sometimes intermittent issues which is hard to track down.

sdeziel · December 19, 2022, 4:40pm

I understand your point regarding multiple LXD hosts but I think the LXD dashboard overview should remain a per-project view. The rational being that those running a monitoring stack would typically have other things providing per-host details (i.e: node exporter).

For network errors/drops, I think the per-project graph should be enough to identify if there are any such error condition in which case, the operator would be expected to go to the Prometheus host and run some queries to identify the precise source.

When we get back from vacation, I’ll chat with Stéphane about the TX/RX and read/write graphs as those feel like good additions, thanks for suggesting!

erik_lonroth · December 20, 2022, 7:24pm

@sdeziel

I’m in a bit of a distress about having packet drops which I’m unable to track down.

We have a single project “default” and as you can see, we seem to have some packet drops… somewhere.

We suspect this is affecting our haproxy which periodically (every 2 minutes) introduces high latency. I’ve been chasing this issue for days without success.

Would you know how to track the packet loss ?

sdeziel · December 20, 2022, 7:28pm

This Prometheus query should help you find which container’s device has non 0 TX errors:

sum by (name, device) (rate(lxd_network_transmit_errs_total{device!="lo",job="lxd-c2d",project="default"}[5m])) > 0

erik_lonroth · December 20, 2022, 7:41pm

Thanx, I modified it somewhat and it showed me something I can work with.

Seems the mp/s is a low number…

sdeziel · December 20, 2022, 8:06pm

Sounds like “millipacket” per second to me.

sdeziel · January 4, 2023, 10:15pm

After discussing with Stéphane, we concluded that adding the project overview graphs for disk wouldn’t be that useful because few storage backends expose those metrics.

Block/IO schedulers are being dropped to reclaim performance. The drawback is that we lose visibility along the way. We’ll evaluate if we keep the other broken disk metrics in the long run or not.

grafana: add top network usage graphs by simondeziel · Pull Request #11258 · lxc/lxd · GitHub adds 4 pie charts for network usage.

erik_lonroth · January 12, 2023, 4:10pm

I’ve tested the new LXD dashboard and the network information makes a great addition to the insights we get from the hosts. @stgraber fantastic!

kamzar1 · January 13, 2023, 11:18pm

Your simple bash worked fine. Thanks for sharing.
Though it lists instances in the actual project only.

As project list is not offering options for output format and fields selection, makes it a little trickier, but this will do:

for i in $(lxc project ls | grep -v NAME | grep -v + | cut -d " " -f 2); do
for k in $(lxc list -f csv -c n --project ${i}); do
echo "---$i-- $k -----"; lxc info $k --project $i| grep -i eth0 -A 7 | grep -i Bytes -A 2;
done
done

cemzafer · January 14, 2023, 7:26am

Thanks @kamzar1, looks fine.
Regards.