I have a problem with copying containers from lxd-01 to lxd-02. Correction: since yesterday.
lxc remote list
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| NAME | URL | PROTOCOL | AUTH TYPE | PUBLIC | STATIC | GLOBAL |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| lxd-02 | https://192.168.1.2:8443 | lxd | tls | NO | NO | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| images | https://images.linuxcontainers.org | simplestreams | none | YES | NO | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| local (current) | unix:// | lxd | file access | NO | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| ubuntu | Ubuntu Cloud Images | simplestreams | none | YES | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| ubuntu-daily | Ubuntu Cloud Images | simplestreams | none | YES | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-----
It is trying to use the public ip from lxd-01:
Error: Failed instance creation: Error transferring instance data: Unable to connect to: 81.x.x.x:8443
Command:
lxc copy CT lxd-02:CT-backup
This has worked well for 1.5 years. Out of nowhere, LXD tries to copy the container via the local public address (lxd-01) instead of the internal address from lxd-02.
Network lxd-01: (3: = migration IP, 2: = public ip)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether x:x:x:x brd ff:ff:ff:ff:ff:ff
inet 81.x.x.x/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
3: vlan4@enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
link/ether x:x:x:x brd ff:ff:ff:ff:ff:ff
inet 192.168.1.1/24 brd 192.168.1.255 scope global vlan4
valid_lft forever preferred_lft forever
Checked the firewall and ICMP. All ok. Added 192.168.1.2 for the second time with lxc remote add and everything seems to be fine. I’m sure it’s not the port or network. Please check this.
Nothing special in the debug log. Just cancelling the job.
The way lxc copy works is that it instructs the target server (lxd-02) to connect to the source (local) to fetch the instance.
To do that, LXD fetches the addresses of the source (visible in lxc info local:) and feeds that to the remote server. When multiple addresses are present, we iterate through them.
I know that @tomp made a recent change to avoid needless retries in some cases, but this shouldn’t apply to cases where an address isn’t reachable.
So I’d recommend running lxc info local: and then look at what’s listed under addresses in the environment section and see if those are correct and one is properly reachable from the target server.
If not, then that’s the issue, but if you see both the public address (the one that’s failing) and a private address which should have worked, then it may be a regression in @tomp’s change.
A temporary workaround may be to flip the direction using --mode=push or --mode=relay which will always work at the potential cost of some added cpu/bandwidth usage.
Ok, that sounds like a regression in @tomp’s change.
The updated logic does not try the other addresses if an address results in a “late” error. In this case it definitely would be expected to keep on trying, but the fact that the error only indicates a single address shows that it’s not doing that.
I’m off today but I can look into this, possibly reverting the change we made.
Until this is done, using --mode=relay should make things behave.
The change attempted to create a remote operation and if that failed then moved onto next IP, but if it succeeds then it tries to start the actual transfer using the same address.
But it sound like I don’t understand the intracasies of the remote operation’s various flavours.
Sounds like that needs to be reverted and then expand our test suite to cover the normal retry scenario as well as the scenario the PR fixed which was where the connection itself succeeded but the operation itself fails and shouldn’t be retried.