Lxc copy to internal IP instead of public IP

TomvB · June 24, 2021, 8:42am

Hi,

I have a problem with copying containers from lxd-01 to lxd-02. Correction: since yesterday.

lxc remote list

±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| NAME | URL | PROTOCOL | AUTH TYPE | PUBLIC | STATIC | GLOBAL |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| lxd-02 | https://192.168.1.2:8443 | lxd | tls | NO | NO | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| images | https://images.linuxcontainers.org | simplestreams | none | YES | NO | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| local (current) | unix:// | lxd | file access | NO | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| ubuntu | Ubuntu Cloud Images | simplestreams | none | YES | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| ubuntu-daily | Ubuntu Cloud Images | simplestreams | none | YES | YES | NO |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-----

It is trying to use the public ip from lxd-01:

Error: Failed instance creation: Error transferring instance data: Unable to connect to: 81.x.x.x:8443

Command:

lxc copy CT lxd-02:CT-backup

This has worked well for 1.5 years. Out of nowhere, LXD tries to copy the container via the local public address (lxd-01) instead of the internal address from lxd-02.

Network lxd-01: (3: = migration IP, 2: = public ip)

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether x:x:x:x brd ff:ff:ff:ff:ff:ff
inet 81.x.x.x/32 scope global enp0s31f6
valid_lft forever preferred_lft forever
3: vlan4@enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
link/ether x:x:x:x brd ff:ff:ff:ff:ff:ff
inet 192.168.1.1/24 brd 192.168.1.255 scope global vlan4
valid_lft forever preferred_lft forever

Checked the firewall and ICMP. All ok. Added 192.168.1.2 for the second time with lxc remote add and everything seems to be fine. I’m sure it’s not the port or network. Please check this.

Nothing special in the debug log. Just cancelling the job.

status: Cancelled
status_code: 401
updated_at: “2021-06-24T15:01:09.323267072+02:00”
timestamp: “2021-06-24T15:01:19.484086159+02:00”
type: operation

stgraber · June 24, 2021, 2:38pm

The way lxc copy works is that it instructs the target server (lxd-02) to connect to the source (local) to fetch the instance.

To do that, LXD fetches the addresses of the source (visible in lxc info local:) and feeds that to the remote server. When multiple addresses are present, we iterate through them.

I know that @tomp made a recent change to avoid needless retries in some cases, but this shouldn’t apply to cases where an address isn’t reachable.

So I’d recommend running lxc info local: and then look at what’s listed under addresses in the environment section and see if those are correct and one is properly reachable from the target server.

If not, then that’s the issue, but if you see both the public address (the one that’s failing) and a private address which should have worked, then it may be a regression in @tomp’s change.

A temporary workaround may be to flip the direction using --mode=push or --mode=relay which will always work at the potential cost of some added cpu/bandwidth usage.

TomvB · June 24, 2021, 3:16pm

Thanks for the explanation.

@tomp lxd-01 addresses.

environment:
addresses:

81.x.x.x:8443

‘[ipv6]:8443’

192.168.1.1:8443

10.21.121.1:8443

‘[ipv6]:8443’

I want to use 192.168.1.1 to 192.168.1.2 to copy containers and vm’s and not the public IPv6 or IPv4 addresses/network.

After lxc copy CT lxd-02:CT-Backup:

Error: Failed instance creation: Error transferring instance data: Unable to connect to: 81.x.x.x:8443

stgraber · June 24, 2021, 4:51pm

Ok, that sounds like a regression in @tomp’s change.

The updated logic does not try the other addresses if an address results in a “late” error. In this case it definitely would be expected to keep on trying, but the fact that the error only indicates a single address shows that it’s not doing that.

I’m off today but I can look into this, possibly reverting the change we made.

Until this is done, using --mode=relay should make things behave.

tomp · June 24, 2021, 5:03pm

Oh did that get released already?

The change attempted to create a remote operation and if that failed then moved onto next IP, but if it succeeds then it tries to start the actual transfer using the same address.

https://github.com/lxc/lxd/pull/8918

But it sound like I don’t understand the intracasies of the remote operation’s various flavours.

Sounds like that needs to be reverted and then expand our test suite to cover the normal retry scenario as well as the scenario the PR fixed which was where the connection itself succeeded but the operation itself fails and shouldn’t be retried.

tomp · June 24, 2021, 5:11pm

There’s a revert PR and I’ll add it to my list to look into:

https://github.com/lxc/lxd/pull/8947

tomp · June 24, 2021, 5:17pm

@stgraber if the original fix can wait until I get back feel free to assign this one to me, I’m tracking it in Trello too. Thanks

TomvB · June 24, 2021, 5:34pm

Thank you Thomas!

tomp · July 2, 2021, 8:56am

There is a new PR merged now that reattempts the original intention:

https://github.com/lxc/lxd/pull/8967

TomvB · July 2, 2021, 1:13pm

Thanks! I can’t copy containers yet. When will the fix be available for snap users?

tomp · July 2, 2021, 1:19pm

The fix for allowing you to copy containers was the revert 8 days ago:

https://github.com/lxc/lxd/pull/8947

But perhaps @stgraber hasn’t included that in the snap yet.

The latest fix is to reimplement the fix for the original problem without introducing a regression again.

TomvB · July 5, 2021, 6:59pm

Error: Failed instance creation: Error transferring instance data: Unable to connect to:

@stgraber any update about this?

stgraber · July 5, 2021, 8:29pm

Pushed cherry-picks to candidate now which includes the reworked client logic.

TomvB · July 9, 2021, 2:18pm

Yay, it works! Thanks