I’ve had a problem for some-time, but now that I have a cluster and working migration, it’s becoming a bit of a headache. When I copy instances around (incus copy) I tend to see speeds of around 100Mb/sec (consistently), however when I use the migration option via the UI, although it starts at this 100Mb/sec, it drops steadily, so by the end of a 1G copy, it’s down to 20Mb/sec. For larger instances it can drop to below 10.
I “seem” to have had better results in the past when copying instances without any attached snapshots … but snapshots attached to an instance shouldn’t make “that” much of a difference (?) … does anyone know what might be happening here / any way to mitigate this speed drop-off?
Ok, found the problem. Apparently the first “copy” I take, incus copy makes use of ZFS send and receive. Which is is my expectation of what should always happen when copying snapshots between machines … because this is … well, the most efficient mechanism (??)
On the next and subsequent passes however it uses rsync. My mind is blown, it feels like putting a rabbit in for the first lap, then replacing it with a snail for the next 99, there must be something I’m missing here or something I don’t understand or something I’m doing wrong, send/recv have an incremental feature for exactly this purpose, which is why “syncoid” was highlighting this problem in the first place.
Anyone any idea why this is or how to make it always use send/recv? On the one hand it feels like I need to implement my own incus copy --refresh, but on the other hand it feels like I must be missing something??
Can you show specifically how to reproduce the issue? What exact version of incus you are on, how the nodes are set up, what exactly you do, and perhaps the logs which show which API calls are being made.
You said you do this via the web UI, but is there a CLI command which exhibits the same behaviour? That would make it easier to reproduce.
On the one hand it feels like I need to implement my own incus copy --refresh
Are you saying that incus copy --refresh from the command line isn’t working properly? Or is it just that the web UI is not using the --refresh functionality?
I do agree that it would be nice not to have to specify --refresh but I expect there is a good reason why it’s not tried by default.
Ok, thanks for the reply - I’ve just found the problem. Because it was slow, I initially did the copy with “–instance-only”, which appears to be the issue.
There appears to be some logic in incus copy that makes a decision on whether it can use send/recv, or whether it has to fall back to rsync to cope with some unknown inconsistency. It would appear that if it can see all snapshots, it will use send/recv, but if it just has the instance it will fallback to rsync on the second and subsequent passes … which means in order to use --refresh effectively it would appear you must also copy all the snapshots and hence need to avoid “instance-only” like the plague.
(so my incremental copy has now dropped from minutes of high load to under 2s)
My guess is that this is the same for the cli and ui given the logic and copy take place in the background and are managed by the server.
For completeness, what I wanted were hot-standby instances on a backup machine such that in the event of a node failure, I could recover (if necessary) immediately on a spare node.
This is what I’m doing;
incus 6.9 (stable)
zfs 2.2.3
Raspberry Pi5 (aarm64) / Debian / 6.6.74
Cluster with 5 nodes
#!/usr/bin/env bash
self=rad
project=standby
for i in `incus list -c nL -f csv,noheader`
do
inst=`echo $i|cut -d"," -f1`
host=`echo $i|cut -d"," -f2`
if [ "$host" == "$self" ]
then
echo "Skip: ${inst}"
continue
fi
incus copy $inst $inst --target-project=${project} --refresh --target=${self}
done
this is my edited down version of what’s currently in-flight, so I’ve not checked this specific script yet, but hopefully you get the idea.
At the moment it “looks” like it’s going to work … should have finished the first pass again in 20 mins ot so …
root@rad:~# time standby.sh # first pass, fullcopy ~ 100GB
real 15m55.160s
user 0m1.302s
sys 0m0.831s
root@rad:~# time standby.sh # second pass, incremental
real 0m38.366s
user 0m0.504s
sys 0m0.227s
Just ran it again with dstat, getting wire-speed on everything that is transferred, which is cool, also took 38s, 14 instances, so maybe a couple of seconds per instance, I could definitely run this every hour
Ah right. I guess that even if it made a temporary snapshot and did zfs send/recv, it would then delete the temporary snapshot, so would have nothing to base a future incremental copy on.
Maybe there should be a way to copy just the latest snapshot, as opposed to all snapshots. Something similar exists as --refresh-exclude-older for refreshes.
Erm, yeah, I did see that option, wasn’t totally sure I understood what it was going to do from the description so didn’t try it. What I really want are two options in the UI;
Incremental tickbox with time selector, target project and target server (s)
Migrate, in such a way that it does a full copy with the instance running, then shuts down, does an incremental, then restarts on the new server. (given live migrate seems not to be an option)
Currently migrate seems to be shutdown, full copy, watch and wait, start on target … again unless I’m missing something, this seems to be quite a labor intensive process that involves non-trivial downtime.
I’m toying with the idea of building my own GUI, Incus (backend) seems to have all the bells and whistles and although the front-end works Ok, it seems to be missing some polish like migration, hot backups, and even the ability to move instances between projects. I’ve been trying to see / find out if anyone else is working on something or whether the current version is “it”, but not seen anything …