Performance issue for single container using zfs on nvme

erik_lonroth · October 23, 2021, 4:30pm

I’m running a setup as:

Host:
- Ubuntu 20.04.3 LTS (Fully patched as of today)
- AMD EPYC 7302P 16-Core
- 128G DDR4
- 1G network
- 1x2TB nvme (SAMSUNG MZWLJ1T9HBJR-00007)
lxd: 4.0.7 (snap)
zfs:
- zfs-0.8.3-1ubuntu12.12
- zfs-kmod-0.8.3-1ubuntu12.12

The setup of the LXD host has followed this guide on every aspect: Production setup | LXD

The zfs-pool is default as per lxd-init and uses the single NVME disk on its own partition. (We are about to expand this with more disk-devices soon to mitigate future problems)

The problem description

One container is running a websocket service (ws://) accepting about 250 tcp connections normally.

The problem manifest itself such as that the websocket service “chokes” gradually as connections ramp up. Its starts as early as at about 30-40 and eventually only responds sporadically.

What about the service?

I would have suspected the service itself to be the culprit, if it wasn’t that the “choking” behavior seem to propagate to the container itself at tiems. E.g. keyboard entries, commands and outputs completely freezes for about a few seconds now and then and then resumes.

Observations and actions

None of the other services or containers has this problem.
The disk is at about 60% usage and 35% frag.
No iowait can be observed.
CPU utilization is 7/16 (E.g. shouldn’t be an issue)
We have replaced the container (re-install) but the problems come back.
We get the same problem on a different identical host.
Only one of the CPU:s is running at 100% (out of 16) so the blocking behavior shouldn’t be the cause of a CPU exhaustion.

Similar issue before

I’ve seen this behavior before (In this thread) and at that point I thought the zfs component would be the issue. I managed at that point to get out of the situation by removing the service itself. But now that I got to this situation again, on a completely different system, I need to understand on how to pursue this issue fundamentally.

My next step - my questions.

I will try to run the service directly on the host, to see if LXD is the issue here but I have some questions:

Is there any default capacity constraints on a lxd container which I might be hitting on this node?
Is there any recommended settings on zfs pools/volumes that is intended to be used with lxd containers on SSD/nvme such as:
zfs set primarycache=metadata lxdhosts/containers/juju-46be60-0
zfs set secondarycache=metadata lxdhosts/containers/juju-46be60-0
Is there any professional help to get on LXD/ZFS to assist me in finding the problem at hand?

stgraber · October 25, 2021, 2:23am

I’m not seeing anything above which really suggests that ZFS is the issue here, your service doesn’t seem to be I/O dependent so it seems more likely that you’re hitting some kind of max_connection/max_open limit.

Anything odd looking in dmesg? How many open fds does your daemon have?

erik_lonroth · October 25, 2021, 8:26am

I’m on the same track @stgraber here - but since I’ve seen it before. I want at good strategy to rule it out.

The scary thing, is that the service at times runs all fine for days or even weeks. This is indeed an indication that its the service-code that causes this, but… yeah.

Also, I would love to know what tools I should use to figure this out. I’m using combinations of “dstat”, “top”, “htop”, iotop", “systemd-cgtop” to query and look for patterns. They don’t give the same view, so I would love to know how the pros do it to determine load. For example to be able to look at specific disks, pools, etc. to be able to rule out disk.

erik_lonroth · October 25, 2021, 9:44pm

The daemon is about 800+, but not particular high. What we notice is a huge difference in time when doing “lsof” on the container vs lxc-host as pictured below. This is interesting for us as it now turns our eyes towards lxd rather than zfs. Thanx alot for this pointer.

Below the timing of the same process in vs outside of the container. Notice the huge difference.

Worth mentioning is that we applied “sysctl -p /etc/sysctld.d/99-zfs.conf” on the host without a boot. Perhaps we need to boot the node to get the settings below to apply fully?

cat /etc/sysctl.d/99-lxd.conf
fs.aio-max-nr = 524288
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
kernel.dmesg_restrict = 1
kernel.keys.maxbytes = 2000000
kernel.keys.maxkeys = 2000
net.core.bpf_jit_limit = 3000000000
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv6.neigh.default.gc_thresh3 = 8192
vm.max_map_count = 262144

In the container:
sysctl -n fs.file-max
9223372036854775807

Also, for the process (daemon) limits seems not to be the issue as you can see below.

We will try run the process tomorrow outside of the lxd host to try isolate the issue from lxd. Hope we can come back with even more info then.

This is a nasty performance bug for us… Thanx for all help so far.

stgraber · October 25, 2021, 10:23pm

Can you show the same limits taken from the daemon running on the host?

Also, is ls /proc/PID/fd/ | wc -l similarly slow?

erik_lonroth · October 25, 2021, 10:45pm

Can you be a bit more explicit on what you need…

dwellir1 = lxd-host
juju-ce0707-4 = the container

Which daemon is it that you need?

On the lx-host, the process I’m having issues with produces this:

time ls -l /proc/2390455/fd | wc -l
773

real	0m0.012s
user	0m0.006s
sys	0m0.008s

E.g. subsecond exec.

stgraber · October 25, 2021, 11:12pm

I want time ls -l /proc/PID/fd/ | wc -l and cat /proc/PID/limits for your daemon, both running on the host and running in the container.

erik_lonroth · October 25, 2021, 11:48pm

Right, I’ll hunt it down for you tomorrow. Been fighting this for days now…

erik_lonroth · October 26, 2021, 9:25pm

As you requested - again thanx for assisting.

Limits on container (for the running daemon):

Limits on host (for the same running daemon):

Time on ls - on container.

Time on ls - on host.

The only remarkable difference is that “lsof” takes considerably longer time in the container vs on the host.

On the container (1m+)

On the host (0.2s)

However, we have now tested to run the service outside from LXC with same result.

The only thing I can come to mind now is that we have not applied the “tzqueuelen” on the server and we didn’t reboot after applying the kernel parameters. This is also something we will do, but what are your thoughts on that?

Another questions on LXD, would kernel settings be applied to containers as part of a sysctl -w or as set in /etc/sysctl.conf or would this only matter to the host?