I have a LXD host with 128G RAM monitored by prometheus + grafana.
Grafana threw an alarm that memory usage if high (94%).
When I query all running containers on the host, they report a total of 40G used.
When I query the host however, it reports 115G used.
While I understand this would be a tricky situation to sort out, what metric or tool should I use as a way to monitor my host for potential memory capacity problems?
- Grafana - reports 94%
- Host reports 115G = 90%
- Sum of all containers reports 40/128 = 31%
Here is a picture that describes the situation a bit better.
Could be the ZFS ARC cache that doesn’t show in
free's cache field.
See The LXD Server with cluster mode is running out of memory - #4 by tomp
Its a great advice to look in to. I commented of the topic The LXD Server with cluster mode is running out of memory - #6 by erik_lonroth
So, the server reports as follows:
Does this mean that the ARC cache use up 56GB for zfs? (Server has = 128GB total)
ARC size (current): 89.1 % 56.0 GiB
If so, what can we do to reduce this to far less levels, since we need the RAM for processes rather than disk.
Are there recommendations here or how should we think about tuning this?
For example, would we cap this to lets say 8GB or even less or how?
There’s some tips on how to tune the ARC cache:
Although it doesn’t present suggestions as to specific values.
I suppose it depends on your specific workloads.
Thanx alot! I came across this article earlier.
The article mentions rebuilding the initramfs as:
sudo update-initramfs -u -k all
Which seems a bit scary since the kernel parameters can be set without a reboot etc.
Would the guide in the article correctly describe the process of setting these variables on Ubuntu 20.04?
From what I can work out the live kernel settings change only take effect if the
zfs_arc_max is lower than
zfs_arc_min. So you may need to update the kernel boot parameters to to lower
Right, so we would need to test this first then I guess.
I’m thinking about setting the limits to about 2GB as seems to be what others have done. It feels a bit arbitrary at the moment, since we don’t really know.
Whats your gut feeling here given that our hosts are supposed to be dedicated LXD hosts?
Apart from the monitoring alert, was there any actual problem?
Or is it just a reporting issue, as ideally you do want to use as much unused memory for disk caching.
This is normally what happens with other filesystems.
The alert. Only that. We had no issues. But the SWAP also started to fill up.
In theory, the ARC’s size is dynamically shrunk if there is memory pressure. If there is no memory pressure, having a big ARC translate to faster IOs. That back-off behavior should be easy to test by creating some big instance and watching the ARC shrink.
Contrary to popular belief, some swapping is good as it lets you better use your RAM. It only becomes a problem when there is lot of swap in/out activity. Here’s an interesting article about swap:
Totally, we thought this was the case but we haven’t tested it.
The problem is that “how do we monitor” our hosts to capture a pending “oom” situation when the hosts are constantly at “95%” RAM. I mean, how can we tell if the server is going to dive due to RAM shortage as opposed to “just normal ZFS caching”.
We also need to understand where new workload is possible or should be ran on a different host. When all hosts are at 95% RAM usage - how can we determine where new load should be placed?
Thanx for the pointers for info. Appreciated.
What metric is your “RAM Used” gauge plotting?
I was wondering if that included the buffer field from
free command (as that is the equivalent of the cache for non-ZFS filesystems). If it excluded that from the metric, then you could potentially also exclude the ARC cache from that metric too.
I have done some testing on how well the dynamic functionality works, to decide if we should decrease the max limit or not.
To fill the ARC cache I used
fio in a few containers. Example command:
fio --direct=0 --name=myfile --rw=read --ioengine=sync --bs=4k --numjobs=1 --size=16G --runtime=2400
And when the ARC cache is filled I tried to claim more memory than what was available, in another container.
cat <( </dev/zero head -c 15000m) <(sleep 120) | tail
In the first tests I was kicked out from the containers with errors like
Error: websocket: close 1006 (abnormal closure): unexpected EOF
Error: write unix @->/var/snap/lxd/common/lxd/unix.socket: i/o timeout
But then I started the different processes inside
screens instead. And the processes were actually running just fine.
So my conclusion is that the dynamic thingy seems to work well.
But as @erik_lonroth said, we need to find a good way to monitor this.
So, as @Joakim_Nyman said - we today started monitoring based on a combination of ARC cache used and available RAM.
The question remains as how we should deal with placement of new containers based on “load” parameters. As I believe the LXD team is exploring how to perform some kind of automatic “load-balancing”, I would be interested in knowing how you think about this situation.
“On what performance metrics do we decide on container placement?”