Lxc move error websocket: bad handshake

TomvB · December 7, 2021, 4:40pm

–mode=push works for me.
The MTU size is 1400 on the internal nics.

I am using the LXD snap on the host to manage the containers. (4.20)

tomp · December 7, 2021, 4:47pm

OK so this could be an MTU mismatch somewhere in your network when the direction of the transmission is revered.

TomvB · December 7, 2021, 4:56pm

I dont think so. MTU size is 1400 on both servers. I haven’t changed anything on the LXD servers in the past year.

tomp · December 7, 2021, 5:04pm

Hrm I’m not sure then, there’s something apparently interfering with the websocket handshake in one direction.

tomp · December 7, 2021, 5:06pm

Can you do the exercise that @kriszos did here Lxc move error websocket: bad handshake - #16 by tomp

kriszos · December 8, 2021, 2:11pm

I also don’t think this is MTU issue. On my network lxd nodes listen on dummy interfaces with MTU set to 1406, and connect to other nodes with ipsec secured gre tunnels also with MTU 1406. Routing is done via ospf BIRD deamon. To test i also changed ospf costs, so communication could go by different route, no change.
Nodes are able to communicate via public addresses without fragmentation with standard MTU set to 1500.

Regarding the --mode parameter, I have cluster, so I get error message:

Error: The --mode flag can’t be used with --target

TomvB · December 8, 2021, 5:14pm

tomp · December 8, 2021, 5:22pm

OK so the error here is:

t=2021-12-08T18:08:21+0100 lvl=dbug msg="Failure for websocket operation: 44ffa23a-3c38-4144-827c-5c2fbb45f4bd: Timed out waiting for connections"

Which was introduced in LXD 4.20

Did you also record the log for the sender/receiver (we’re missing part of it at least).

It may be that error is a side effect of the error you’re seeing about websocket failing to negotiate, as we would then expect the other side to time out when waiting for the connection.

TomvB · December 8, 2021, 7:01pm

Destination node (host-2) Ignore the user and server names.

tomp · December 9, 2021, 9:42am

OK so we can see the error from the destination member trying to connect back to the source:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: 303cd2b1-9752-4cb1-bd84-ed537732c78f: Error transferring instance data: websocket: bad handshake"

I can also see above:

t=2021-12-08T19:56:05+0100 lvl=dbug msg="Failure for task operation: ba3fa1e7-51b5-484c-b8f5-855035d9ebc0: Error transferring instance data: Unable to connect to: [2a01:....]:8443"

Can you show the output of sudo ss -tlpn on both the source and destination servers please.

Also, can you get the last part of the IPv6 address used on that error line?

TomvB · December 9, 2021, 9:48am

I don’t use that IPv6 address, but only 192.168.1.x. Why is it trying to connect over IPv6?

tomp · December 9, 2021, 9:49am

I’m not sure, but please could you provide the output of the commands I asked for please.

tomp · December 9, 2021, 9:50am

Please can you also supply output of lxc config show for both hosts?

tomp · December 9, 2021, 9:54am

A working theory I have at the moment is that LXD is trying to connect on all of the available IPs sequentially (which it does in some places, although I’m not certain this is one of them, although its a strong possibility given what we are seeing), and this can take time as some IPs are not listening (perhaps a firewall is blocking the request causing a timeout rather than connection refused).

Previously it would have kept trying until it found the right one, but with the listener timeouts added in 4.20, the timeout is hit before the correct IP is found, and by the time the destination member does try and connect on it the websocket on the source has been closed.

tomp · December 9, 2021, 9:55am

One thing you could do to try and prove this theory is to make sure that inbound requests to that IPv6 address are rejected rather than dropped (if they are now) so that LXD destination will immediately move onto the next IP. If that works it’ll prove my theory.

kriszos · December 9, 2021, 10:46am

root@atl1:~# lxc config show
config:
cluster.https_address: 172.31.255.10:8443
core.bgp_asn: “65000”
core.https_address: 172.31.255.10:8443
core.trust_password: true
storage.backups_volume: local/backups
storage.images_volume: local/images

root@nazwa1:~# lxc config show
config:
cluster.https_address: 172.31.255.6:8443
core.bgp_asn: “65000”
core.https_address: 172.31.255.6:8443
core.trust_password: true
storage.backups_volume: local/backups
storage.images_volume: local/images

tomp · December 9, 2021, 10:52am

Are you able to try this?

kriszos · December 9, 2021, 10:56am

But I am actively using ipv6

tomp · December 9, 2021, 10:59am

You’ve not shown me the output of ss that I asked for, but based on the config you have shown me I’m assuming that LXD isn’t listening on the IPv6 address.

So based on my comment here and here I’m thinking that perhaps you have a firewall that is blocking inbound connections to port 8443 on the IPv6 address (you’ve not confirmed this), and that one way we could identify if my theory was correct is if you added a rule to your firewall (if you have one on that host) to cause connections to the IPv6 address on port 8443 to be rejected rather than dropped, so that LXD can quickly fail over to the IPv4 address.

I’m not proposing you stop using IPv6.

TomvB · December 9, 2021, 5:09pm

node-1 and node-2. (node-2 reinstalled today)

config:
  core.https_address: '[::]:8443'
  core.trust_password: true

Output ss -tlpn:
https://pastebin.com/BNVfPgxK

Firewall ports are open between the servers.