Lxc move error websocket: bad handshake

OK so we can see the error from the destination member trying to connect back to the source:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: 303cd2b1-9752-4cb1-bd84-ed537732c78f: Error transferring instance data: websocket: bad handshake"

I can also see above:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: ba3fa1e7-51b5-484c-b8f5-855035d9ebc0: Error transferring instance data: Unable to connect to: [2a01:....]:8443"

Can you show the output of sudo ss -tlpn on both the source and destination servers please.

Also, can you get the last part of the IPv6 address used on that error line?

I don’t use that IPv6 address, but only 192.168.1.x. Why is it trying to connect over IPv6?

I’m not sure, but please could you provide the output of the commands I asked for please.

Please can you also supply output of lxc config show for both hosts?

A working theory I have at the moment is that LXD is trying to connect on all of the available IPs sequentially (which it does in some places, although I’m not certain this is one of them, although its a strong possibility given what we are seeing), and this can take time as some IPs are not listening (perhaps a firewall is blocking the request causing a timeout rather than connection refused).

Previously it would have kept trying until it found the right one, but with the listener timeouts added in 4.20, the timeout is hit before the correct IP is found, and by the time the destination member does try and connect on it the websocket on the source has been closed.

One thing you could do to try and prove this theory is to make sure that inbound requests to that IPv6 address are rejected rather than dropped (if they are now) so that LXD destination will immediately move onto the next IP. If that works it’ll prove my theory.

root@atl1:~# lxc config show
config:
cluster.https_address: 172.31.255.10:8443
core.bgp_asn: “65000”
core.https_address: 172.31.255.10:8443
core.trust_password: true
storage.backups_volume: local/backups
storage.images_volume: local/images

root@nazwa1:~# lxc config show
config:
cluster.https_address: 172.31.255.6:8443
core.bgp_asn: “65000”
core.https_address: 172.31.255.6:8443
core.trust_password: true
storage.backups_volume: local/backups
storage.images_volume: local/images

Are you able to try this?

But I am actively using ipv6

You’ve not shown me the output of ss that I asked for, but based on the config you have shown me I’m assuming that LXD isn’t listening on the IPv6 address.

So based on my comment here and here I’m thinking that perhaps you have a firewall that is blocking inbound connections to port 8443 on the IPv6 address (you’ve not confirmed this), and that one way we could identify if my theory was correct is if you added a rule to your firewall (if you have one on that host) to cause connections to the IPv6 address on port 8443 to be rejected rather than dropped, so that LXD can quickly fail over to the IPv4 address.

I’m not proposing you stop using IPv6.

node-1 and node-2. (node-2 reinstalled today)

config:
  core.https_address: '[::]:8443'
  core.trust_password: true

Output ss -tlpn:
https://pastebin.com/BNVfPgxK

Firewall ports are open between the servers.

OK so LXD will likely use any of the IPs bound on that host, and you’ve tested they are all open.

Do you see errors like Error transferring instance data: Unable to connect to when the problem occurs?

Looking at the code where that error comes from, are you using host names for you remote?

The error is:

  • https://publicip:8443: Error transferring instance data: Unable to connect to: publicip:8443
  • https://[ipv6]:8443: Error transferring instance data: Unable to connect to: [ipv6]:8443
  • https://192.168.1.1:8443: Error transferring instance data: websocket: bad handshake

Yes, i’m using the hostname

±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| NAME | URL | PROTOCOL | AUTH TYPE | PUBLIC | STATIC | GLOBAL |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| host-02 | https://192.168.1.2:8443 | lxd | tls | NO | NO | NO |

lxc copy container host-02:container for example.

Reinstalled host-02 today and no difference after the installation. But I’m sure nothing has changed on my side.

Are these IPS all bound on the source host and reachable from the destination?

I’m trying to understand why it can’t connect on the first 2 IPS it tried.

The following rule is enabled (with ufw):

Port 8443 from 192.168.1.1 to 192.168.1.2 and from 192.168.1.2 to 192.168.1.1

with this config:

config:
core.https_address: ‘[::]:8443’
core.trust_password: true

(re)added the nodes to the remote list with:

lxc remote add hostname ip

Nothing special. I added the vlan for this network in /etc/netplan/netcfg (host-2 example)

vlans:
vlan:
id: 4
mtu: 1400
link: enp2s0
addresses: [ “192.168.1.2/24” ]

I’m using the name from the remote list to copy containers. (not dns)

I do not allow Public IPv4/6 communication between the servers. It’s not defined in the lxc remote list.

Ah so that somewhat changes your earlier statement of “Firewall ports are open between the servers.” When we were talking about if the ipv6 address was allowed, or whether if not, could you make your firewall reject rather than drop.

Can you do this?

I do not use and allow public IPv6 communication between the servers. The servers are standalone with an internal switch in between with only IPv4 configured.

A public IPv4 and IPv6 address is configured on the nodes. I do not use this for copying containers and communication should not go through the public addresses. This is by default not allowed.

The VLAN (internal switch) is dedicated for containers backups with lxc copy.

@stgraber Do you have any idea? I have no idea what has been changed in the past few weeks.

I added following rule to top of my firewall on destination node
ip6tables -A INPUT -j REJECT
and I still got

Error: Copy instance operation failed: Failed instance creation: Error transferring instance data: websocket: bad handshake

I’ve already explained my working theory as to the problem and the change here

And here

If you have ipv6 addresses bound to the host lxd will try and use them. If the firewall is then dropping rather than rejecting the request this could take longer than 10s and cause the web socket listeners to time out.

I was asking if you could setup some ipv6 reject rules to test that theory before we start thinking about a fix in Lxd. Does that make sense?