Lxc move error websocket: bad handshake

@TomvB @kriszos which version of LXD were you upgrading from and to when this started?

Can you also confirm the system time is correct on all servers.

The last change to authenticate was:

But this was in LXD 4.19.

Any ideas @stgraber ?

Time is correct on all servers. Upgrade has been done via snap from 4.19

OK, can you enable debug on the servers affected using:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

And then capture the output when running the affected command (on both source and target servers) from: /var/snap/lxd/common/lxd/logs/lxd.log

As I have no idea what could be causing this.

1 Like

Not sure, using auto updates. Time is correct on both servers.

@TomvB @kriszos can you advise if any of the --mode options for for lxc copy, i.e it defaults to “pull”, so try “push” or “relay”.

@stgraber and I were wondering if this might be a network issue (perhaps MTU) that is interfering with websocket upgrade, as the errors suggest that TLS certificate negotiation has succeeded, but that its failing after that during websocket upgrade.

Also can you confirm if you are using the same version of the client and server (i.e 4.20)?

–mode=push works for me.
The MTU size is 1400 on the internal nics.

I am using the LXD snap on the host to manage the containers. (4.20)

OK so this could be an MTU mismatch somewhere in your network when the direction of the transmission is revered.

I dont think so. MTU size is 1400 on both servers. I haven’t changed anything on the LXD servers in the past year.

Hrm I’m not sure then, there’s something apparently interfering with the websocket handshake in one direction.

Can you do the exercise that @kriszos did here Lxc move error websocket: bad handshake - #16 by tomp

I also don’t think this is MTU issue. On my network lxd nodes listen on dummy interfaces with MTU set to 1406, and connect to other nodes with ipsec secured gre tunnels also with MTU 1406. Routing is done via ospf BIRD deamon. To test i also changed ospf costs, so communication could go by different route, no change.
Nodes are able to communicate via public addresses without fragmentation with standard MTU set to 1500.

Regarding the --mode parameter, I have cluster, so I get error message:

Error: The --mode flag can’t be used with --target

OK so the error here is:

t=2021-12-08T18:08:21+0100 lvl=dbug msg="Failure for websocket operation: 44ffa23a-3c38-4144-827c-5c2fbb45f4bd: Timed out waiting for connections"

Which was introduced in LXD 4.20

Did you also record the log for the sender/receiver (we’re missing part of it at least).

It may be that error is a side effect of the error you’re seeing about websocket failing to negotiate, as we would then expect the other side to time out when waiting for the connection.

Destination node (host-2) Ignore the user and server names.

OK so we can see the error from the destination member trying to connect back to the source:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: 303cd2b1-9752-4cb1-bd84-ed537732c78f: Error transferring instance data: websocket: bad handshake"

I can also see above:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: ba3fa1e7-51b5-484c-b8f5-855035d9ebc0: Error transferring instance data: Unable to connect to: [2a01:....]:8443"

Can you show the output of sudo ss -tlpn on both the source and destination servers please.

Also, can you get the last part of the IPv6 address used on that error line?