Network migration speed drop-off

oddjobz · February 21, 2025, 6:12pm

I’ve had a problem for some-time, but now that I have a cluster and working migration, it’s becoming a bit of a headache. When I copy instances around (incus copy) I tend to see speeds of around 100Mb/sec (consistently), however when I use the migration option via the UI, although it starts at this 100Mb/sec, it drops steadily, so by the end of a 1G copy, it’s down to 20Mb/sec. For larger instances it can drop to below 10.

I “seem” to have had better results in the past when copying instances without any attached snapshots … but snapshots attached to an instance shouldn’t make “that” much of a difference (?) … does anyone know what might be happening here / any way to mitigate this speed drop-off?

oddjobz · February 27, 2025, 11:19am

Ok, found the problem. Apparently the first “copy” I take, incus copy makes use of ZFS send and receive. Which is is my expectation of what should always happen when copying snapshots between machines … because this is … well, the most efficient mechanism (??)

On the next and subsequent passes however it uses rsync. My mind is blown, it feels like putting a rabbit in for the first lap, then replacing it with a snail for the next 99, there must be something I’m missing here or something I don’t understand or something I’m doing wrong, send/recv have an incremental feature for exactly this purpose, which is why “syncoid” was highlighting this problem in the first place.

Anyone any idea why this is or how to make it always use send/recv? On the one hand it feels like I need to implement my own incus copy --refresh, but on the other hand it feels like I must be missing something??

candlerb · February 27, 2025, 11:39am

Can you show specifically how to reproduce the issue? What exact version of incus you are on, how the nodes are set up, what exactly you do, and perhaps the logs which show which API calls are being made.

You said you do this via the web UI, but is there a CLI command which exhibits the same behaviour? That would make it easier to reproduce.

On the one hand it feels like I need to implement my own incus copy --refresh

Are you saying that incus copy --refresh from the command line isn’t working properly? Or is it just that the web UI is not using the --refresh functionality?

I do agree that it would be nice not to have to specify --refresh but I expect there is a good reason why it’s not tried by default.

oddjobz · February 27, 2025, 11:55am

Ok, thanks for the reply - I’ve just found the problem. Because it was slow, I initially did the copy with “–instance-only”, which appears to be the issue.

There appears to be some logic in incus copy that makes a decision on whether it can use send/recv, or whether it has to fall back to rsync to cope with some unknown inconsistency. It would appear that if it can see all snapshots, it will use send/recv, but if it just has the instance it will fallback to rsync on the second and subsequent passes … which means in order to use --refresh effectively it would appear you must also copy all the snapshots and hence need to avoid “instance-only” like the plague.

(so my incremental copy has now dropped from minutes of high load to under 2s)

My guess is that this is the same for the cli and ui given the logic and copy take place in the background and are managed by the server.

For completeness, what I wanted were hot-standby instances on a backup machine such that in the event of a node failure, I could recover (if necessary) immediately on a spare node.

This is what I’m doing;

incus 6.9 (stable)
zfs 2.2.3
Raspberry Pi5 (aarm64) / Debian / 6.6.74
Cluster with 5 nodes

#!/usr/bin/env bash

self=rad
project=standby
for i in `incus list -c nL -f csv,noheader`
do
	inst=`echo $i|cut -d"," -f1`
	host=`echo $i|cut -d"," -f2`
	if [ "$host" == "$self" ]
	then
		echo "Skip: ${inst}"
		continue
	fi
	incus copy $inst $inst --target-project=${project} --refresh --target=${self}
done

this is my edited down version of what’s currently in-flight, so I’ve not checked this specific script yet, but hopefully you get the idea.

At the moment it “looks” like it’s going to work … should have finished the first pass again in 20 mins ot so …

oddjobz · February 27, 2025, 12:03pm

Ok, cool, it finished …

root@rad:~# time standby.sh   # first pass, fullcopy ~ 100GB

real	15m55.160s
user	0m1.302s
sys	0m0.831s
root@rad:~# time standby.sh   # second pass, incremental

real	0m38.366s
user	0m0.504s
sys	0m0.227s

Just ran it again with dstat, getting wire-speed on everything that is transferred, which is cool, also took 38s, 14 instances, so maybe a couple of seconds per instance, I could definitely run this every hour

candlerb · February 27, 2025, 12:10pm

Ah right. I guess that even if it made a temporary snapshot and did zfs send/recv, it would then delete the temporary snapshot, so would have nothing to base a future incremental copy on.

Maybe there should be a way to copy just the latest snapshot, as opposed to all snapshots. Something similar exists as --refresh-exclude-older for refreshes.

oddjobz · February 27, 2025, 12:21pm

Erm, yeah, I did see that option, wasn’t totally sure I understood what it was going to do from the description so didn’t try it. What I really want are two options in the UI;

Incremental tickbox with time selector, target project and target server (s)
Migrate, in such a way that it does a full copy with the instance running, then shuts down, does an incremental, then restarts on the new server. (given live migrate seems not to be an option)

Currently migrate seems to be shutdown, full copy, watch and wait, start on target … again unless I’m missing something, this seems to be quite a labor intensive process that involves non-trivial downtime.

I’m toying with the idea of building my own GUI, Incus (backend) seems to have all the bells and whistles and although the front-end works Ok, it seems to be missing some polish like migration, hot backups, and even the ability to move instances between projects. I’ve been trying to see / find out if anyone else is working on something or whether the current version is “it”, but not seen anything …