Remote copy speed very slow on 2.21

wexi79 · April 4, 2018, 11:20pm

For some reason when copying containers between two hosts (DO droplets) over a private network that should be more than capable bandwidthwise (iperf3 output):

[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 255 MBytes 2.14 Gbits/sec 4048 660 KBytes
[ 4] 1.00-2.00 sec 221 MBytes 1.86 Gbits/sec 26 754 KBytes
[ 4] 2.00-3.00 sec 192 MBytes 1.61 Gbits/sec 146 612 KBytes
[ 4] 3.00-4.00 sec 235 MBytes 1.97 Gbits/sec 18 771 KBytes
[ 4] 4.00-5.00 sec 218 MBytes 1.82 Gbits/sec 492 498 KBytes
[ 4] 5.00-6.00 sec 181 MBytes 1.52 Gbits/sec 44 570 KBytes
[ 4] 6.00-7.00 sec 172 MBytes 1.45 Gbits/sec 577 426 KBytes
[ 4] 7.00-8.00 sec 175 MBytes 1.47 Gbits/sec 70 512 KBytes
[ 4] 8.00-9.00 sec 178 MBytes 1.49 Gbits/sec 65 539 KBytes
[ 4] 9.00-10.00 sec 192 MBytes 1.61 Gbits/sec 375 327 KBytes

[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.97 GBytes 1.69 Gbits/sec 5861 sender
[ 4] 0.00-10.00 sec 1.96 GBytes 1.69 Gbits/sec receiver

the copying of LXD containers however is drastically slower (5-50Mbits/sec).

Another interesting thing is that I also have two systems with LXD 2.0.11, one located in Finland and the other one on DO FRA1 and copying containers between these hosts is MUCH faster (400-500 Mbits/sec). Which of course is surprising as the network bandwidth between these hosts is smaller:

[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 58.7 MBytes 493 Mbits/sec 8 6.03 MBytes
[ 4] 1.00-2.00 sec 85.0 MBytes 713 Mbits/sec 0 6.03 MBytes
[ 4] 2.00-3.00 sec 88.8 MBytes 745 Mbits/sec 0 6.03 MBytes
[ 4] 3.00-4.00 sec 87.5 MBytes 734 Mbits/sec 0 6.03 MBytes
[ 4] 4.00-5.00 sec 92.5 MBytes 776 Mbits/sec 0 6.03 MBytes
[ 4] 5.00-6.00 sec 92.5 MBytes 776 Mbits/sec 0 6.03 MBytes
[ 4] 6.00-7.00 sec 91.2 MBytes 765 Mbits/sec 0 6.03 MBytes
[ 4] 7.00-8.00 sec 93.8 MBytes 786 Mbits/sec 0 6.03 MBytes
[ 4] 8.00-9.00 sec 92.5 MBytes 776 Mbits/sec 0 6.03 MBytes
[ 4] 9.00-10.00 sec 92.5 MBytes 776 Mbits/sec 0 6.03 MBytes

[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 875 MBytes 734 Mbits/sec 8 sender
[ 4] 0.00-10.00 sec 874 MBytes 733 Mbits/sec receiver

All the systems mentioned use are Ubuntu 16.04 systems and LXD is using ZFS as the storage backend.

So the question is why is copying between 2.21 hosts on a faster network so slow?

stgraber · April 5, 2018, 3:33am

That’s pretty slow indeed, I usually see better speeds with zfs to zfs here, about 300Mbit/s which is pretty reasonable considering I’m doing that over wifi.

One thing you could play with is to see whether what side initializes the transfer makes a difference, so measuring for:

–mode=push
–mode=pull
–mode=relay

wexi79 · April 5, 2018, 9:30am

Hello,

I did some testing with different modes and there is no significant difference, push mode is maybe a tiny bit faster but still very slow.

wexi79 · April 5, 2018, 2:15pm

Also transferring a large file over with SFTP achieves “normal” speeds

stgraber · April 5, 2018, 2:25pm

Not sure if easily doable for you, but could you create a non-zfs storage pool on the source server, create a container using it and then attempt to transfer that to the target? That would cause LXD to use rsync rather than zfs send/receive for transfer so could be useful to confirm whether that’s the problem.

stgraber@castiana:~$ lxc storage create test dir
Storage pool test created
stgraber@castiana:~$ lxc init ubuntu:16.04 transfer-test -s test
Creating transfer-test
stgraber@castiana:~$ lxc copy transfer-test s-sateda:
stgraber@castiana:~$ lxc delete -f transfer-test            
stgraber@castiana:~$ lxc storage delete test
Storage pool test deleted
stgraber@castiana:~$ lxc delete s-sateda:transfer-test
stgraber@castiana:~$

wexi79 · April 5, 2018, 3:46pm

OK, transfer speed when copying the dir container was 2-3 times better… So it seems that the bottleneck is the zfs on the source server… Any ideas what can be done about that?

stgraber · April 5, 2018, 4:02pm

It can be a few different things. It can be the zfs send chunking being particularly bad in this scenario, that’s been causing performance issues when piping zfs send over the network before.

It could also just be some zpool performance problem.

One thing you could try to do is:

root@castiana:~# zfs snapshot castiana/lxd/containers/c1@test
root@castiana:~# time zfs send castiana/lxd/containers/c1@test > /dev/null

real	0m2.073s
user	0m0.004s
sys	0m0.911s
root@castiana:~# zfs destroy castiana/lxd/containers/c1@test
root@castiana:~#

stgraber · April 5, 2018, 4:03pm

That would get you your best case scenario transfer speed from the source.

wexi79 · April 5, 2018, 4:44pm

Ok, I tried with a snapshot from a 6.6GB container… and it took forever:

root@lxd1:~# time zfs send lxd/containers/netto@test > /dev/null

real 27m23.755s
user 0m0.000s
sys 6m45.448s

wexi79 · April 5, 2018, 5:13pm

In comparison on my workstation a ~7GB container:

root@wz620:~# time zfs send lxd/containers/neos-dev@test > /dev/null

real 0m38.501s
user 0m0.008s
sys 0m20.264s

stgraber · April 5, 2018, 5:19pm

Ok, so that would indicate a ZFS problem rather than a LXD one then.

wexi79 · April 5, 2018, 5:39pm

Seems so, I’ll have to investigate further why the ZFS on this server is acting out.