Hi, as my network is now starting to look moderately stable, I’ve been looking at the network performance, in particular the uplink between my site and the cloud edge. I have some numbers, but I’ve no frame of references as to whether they’re Ok, or could be better.
What I have
Instance => Incus OVN => IC => (tinc VPN trunk over 100M link) => IC => Incus OVN => Instance
Ping latency on the uplink is ~ 25ms.
Does this look to be good / bad / ugly … ?
(can I do better, or is this just expected overhead?)
Confusingly, if I run it node-to-node over the VPN link, whereas I would expect figures between these two, it comes in ~ 20% slower than the instance to instance speed (!)
Hi, I’m only using containers … if I run locally via the VPN link I can get up to 250Mbits/sec so I’m thinking it’s not a VPN/CPU issue … although I’m about to switch to AES to see if I can get a hardware boost. One end has access to 4 CPU’s, the other 2. Both ends are ARM64.
Ok, as I don’t know how your network is configured, I’m not entirely sure how to interpret those results … on the one hand you seem to be limited to 1G per channel, but on the other you’re getting 2.5G over 4 channels …
I’ve tried mine with -P4, I do get more throughput, indeed much closer to the speed I get with no encapsulation or VPN in the way;
In my case, it’s two clusters with a 3Gbps residential internet connection between the two (wireguard used between both sites). It’s not unusual for a single stream over my internet connection to cap at around gigabit speed, multiple streams usually gets me past this and gets close to the real internet speed.
Mmm, on the one hand tinc works very well as a mesh and is easy to deploy, but looking at the way it’s working under load I’m beginning to wonder whether it’s the right solution. I’ve noticed that when iperf3 runs it loads “tincd” on all three cluster nodes, despite traffic only flowing through the one node where the process is running. Not sure whether this is a function of the mesh or whether I’ve something set wrong.
It’s just occurred to me that when I said the VPN is running on the node at each end, the far end node is a cloud server instance… Many thanks for those numbers, I’ll do a little more digging. What I have can obviously be improved
Mmm … I’m confusing myself a little here. I reverted to testing the VPN at node level and got worse performance, until I dropped the MTU to 1300, at which point I got twice the throughput. As I increase the MTU, the performance gets progressively worse.
Right, setup the correct MTU value is important for to get the best performance.
I faced a similar issue during I configured my Wireguard VPN. During my research I came across the following gist Wireguard Optimal MTU which contains quite some details in how to find out the correct value.
Hi, many thanks. I think I’m going to try switching to WG, seems like the next thing to try. I had thought I’d got the MTU right for Tinc, but something still doesn’t seem to sit right. I’ll try that git and see what happens … either way it’ll be interesting to see if my provisioner can cope with the switch without completely borking the running cluster …
Which seems a lot better, moreover the CPU usage has dropped from “very significant” to “undetectable”. Altogether, it seems like a no-brainer choosing wg over tinc … although I found I had to manually mesh all the nodes to make geneve happy.
Many thanks for the help … I almost seem to have run out of things to fix …