Slow new connections to LXC VMs and opinion on MTU size consistency between host and LXC bridge

sean · August 24, 2022, 9:52am

I have a physical server with an external interface (MTU 1500), internal interface (MTU 1450), and lxdbr0 set to use MTU 1450.

In LXC there are 2 Ubuntu 22.04 guest VMs (single interface, MTU 1450) that receive service queries from the external network (MTU 1500), and get some data from back-end via internal network (MTU 1450). So far this has worked well.

$ lxc version
Client version: 5.0.1
Server version: 5.0.1

$ snap list --all lxd
Name  Version        Rev    Tracking    Publisher   Notes
lxd   5.0.0-b0287c1  22923  5.0/stable  canonical✓  disabled
lxd   5.0.1-9dcf35b  23541  5.0/stable  canonical✓  -

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

$ sudo lxc network show lxdbr0
config:
  bridge.mtu: "1450"
  ipv4.address: 192.168.1.1/24
  ipv4.dhcp.ranges: 192.168.1.2-192.168.1.5
  ipv4.nat: "true"
  ipv6.address: none
description: ""
name: lxdbr0
type: bridge
used_by:
- /1.0/instances/node1
- /1.0/instances/node2
- /1.0/profiles/default
managed: true
status: Created
locations:
- none

This problem started several days ago. I had 5.0.0-b0287c1 until recently, but it got updated together with OS packages so I can’t tell what specifically caused this change.

Now inbound service clients take a long time (2-6 seconds) to connect to service in LXC VMs. Once a client connects, everything works as fast as it used to, which makes me think of MTU and maybe DNS issues rather than bandwidth or packet loss.

I haven’t changed any firewall or LXC settings in recent days. The host’s network latency (inbound and outbound ping) seems as unchanged (0.4ms to 1.1.1.1, < 40ms from my WAN client).
I’ve looked at various OS logs, NIC stats, MTU/MSS, but can’t find anything unusual. DNS queries resolve quickly both in LXC VMs and on the host.

I plan to use tcpdump to gather data from inbound connections from several symptomatic clients, but if anyone has an idea if it’d be worth to investigate LXC here, please let me know. So far I have no indications that anything is wrong with LXC and I’d rather focus on other areas. But if anyone has experienced increased latency in establishing of new connections to LXC VMs, please share if you think it’s related.

Is it worth changing external interface’s MTU from 1500 to 1450 (to be the same as the VMs and avoid fragmentation)? Current configuration (MTU 1500 on external NIC, 1450 on LXDBR) has worked well for me for several months, so I’m not sure I should mess with that. Does anyone have any experience or opinion with different vs. consistent MTU on host and LXC bridge? I don’t think that is related to my problem, but I’m curious about it in terms of best practices. The host is lightly loaded, so any benefits from less packet fragmentation would probably be negligible. But there may be other reasons why a consistent MTU value would be better.

tomp · August 24, 2022, 1:00pm

Can you show the reproducer commands (such as telnet?) you’re using to demonstrate the problem?

sean · August 24, 2022, 1:43pm

Good idea!

I’ve created a new VM, enabled Nginx, created LXC proxy for it, allowed the port externally with ufw and wget the home page from my client… It was fast, which is great.

The problem is, the affected services also work fine for most clients, but about 30% experience this slowness (with non-Nginx service, running in other VMs). I should use Wireshark to capture flows from some of those problematic clients to see if there’s anything unusual (assuming I know how to find it).

I’ll post back if I make any useful discovery that I can share with the community.

sean · August 25, 2022, 11:48am

Did some digging today:

Filtered slow connections from app logs
Removed ones that are probably slow simply because their Internet is slow
Focused on a handful slow clients from developed countries
Checked ping, tracepath, used tcpdump to capture some data and looked at it

I didn’t find anything revealing:

Most of the slow clients from developed countries weren’t pingable, which may be what interferes with PMTU discovery
tcpdump didn’t show much - there are connection resets and out-of-order packets, but I can’t say how much more (or less) than any other clients, to do that I’d have to capture and analyze a lot more data

I’ve tried adjusting MTU-related settings on the host (in sysctl.conf) - couldn’t find any combination that works better.

Today clients were less slow than yesterday, so if I had to guess I’d say it could be that slowdowns in international Internet routing exacerbate problems with poorly configured clients.

I’ll leave the system as is for a few more days and then try to adjust MTU to be the same on external interface and lxdbr0 - I don’t think that’s a problem, but it’s probably better that way.

Edit: 2-3 days after making this comment: performance is normal again without any significant changes. At this point I’m pretty sure the slowdown (as much as 15%!) was related to periodic variations in the quality of intercontinental links.