Lxc move error websocket: bad handshake

tomp · December 9, 2021, 5:36pm

OK so LXD will likely use any of the IPs bound on that host, and you’ve tested they are all open.

Do you see errors like Error transferring instance data: Unable to connect to when the problem occurs?

tomp · December 9, 2021, 5:38pm

Looking at the code where that error comes from, are you using host names for you remote?

TomvB · December 9, 2021, 7:00pm

The error is:

https://publicip:8443: Error transferring instance data: Unable to connect to: publicip:8443

https://[ipv6]:8443: Error transferring instance data: Unable to connect to: [ipv6]:8443

https://192.168.1.1:8443: Error transferring instance data: websocket: bad handshake

Yes, i’m using the hostname

±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| NAME | URL | PROTOCOL | AUTH TYPE | PUBLIC | STATIC | GLOBAL |
±----------------±-----------------------------------------±--------------±------------±-------±-------±-------+
| host-02 | https://192.168.1.2:8443 | lxd | tls | NO | NO | NO |

lxc copy container host-02:container for example.

Reinstalled host-02 today and no difference after the installation. But I’m sure nothing has changed on my side.

tomp · December 9, 2021, 7:08pm

Are these IPS all bound on the source host and reachable from the destination?

I’m trying to understand why it can’t connect on the first 2 IPS it tried.

TomvB · December 9, 2021, 7:16pm

The following rule is enabled (with ufw):

Port 8443 from 192.168.1.1 to 192.168.1.2 and from 192.168.1.2 to 192.168.1.1

with this config:

config:
core.https_address: ‘[::]:8443’
core.trust_password: true

(re)added the nodes to the remote list with:

lxc remote add hostname ip

Nothing special. I added the vlan for this network in /etc/netplan/netcfg (host-2 example)

vlans:
vlan:
id: 4
mtu: 1400
link: enp2s0
addresses: [ “192.168.1.2/24” ]

I’m using the name from the remote list to copy containers. (not dns)

I do not allow Public IPv4/6 communication between the servers. It’s not defined in the lxc remote list.

tomp · December 9, 2021, 7:25pm

Ah so that somewhat changes your earlier statement of “Firewall ports are open between the servers.” When we were talking about if the ipv6 address was allowed, or whether if not, could you make your firewall reject rather than drop.

Can you do this?

TomvB · December 9, 2021, 7:29pm

I do not use and allow public IPv6 communication between the servers. The servers are standalone with an internal switch in between with only IPv4 configured.

A public IPv4 and IPv6 address is configured on the nodes. I do not use this for copying containers and communication should not go through the public addresses. This is by default not allowed.

The VLAN (internal switch) is dedicated for containers backups with lxc copy.

@stgraber Do you have any idea? I have no idea what has been changed in the past few weeks.

kriszos · December 9, 2021, 7:43pm

I added following rule to top of my firewall on destination node
ip6tables -A INPUT -j REJECT
and I still got

Error: Copy instance operation failed: Failed instance creation: Error transferring instance data: websocket: bad handshake

tomp · December 9, 2021, 7:44pm

I’ve already explained my working theory as to the problem and the change here

And here

If you have ipv6 addresses bound to the host lxd will try and use them. If the firewall is then dropping rather than rejecting the request this could take longer than 10s and cause the web socket listeners to time out.

I was asking if you could setup some ipv6 reject rules to test that theory before we start thinking about a fix in Lxd. Does that make sense?

tomp · December 9, 2021, 7:45pm

It needs to be on the source server though as its in pull mode.

TomvB · December 9, 2021, 7:46pm

tomp:

I have at the moment is that LXD is trying to connect on all of the available IPs sequentially (which it does in some places, although I’m not certain this is one of them, although its a strong possibility given what we are seeing), and this can take time as some IPs are not listening (perhaps a firewall is blocking the request causing a timeout rather than connection refused). Previously it would have kept trying until it found the right one, but with the listener timeouts add…

If you have ipv6 addresses bound to the host lxd will try and use them. If the firewall is then dropping rathervthsn rejecting the request this could take longer than 10s and cause the web socket listeners to time out.

I was asking if you could setup some ipv6 reject rules to test that theory before we start thinking about a fix in Lxd. Does that make sense?

I understand. But it is hard for me to configure this test config right now.
Reminds me a bit of this: Lxc copy to internal IP instead of public IP

tomp · December 9, 2021, 7:51pm

OK hopefully @kriszos can try it.

Its similar error but different cause to that earlier issue.
The reason you experience these issues is that LXD tries to discover a working set of communication channels by trying all IPS from the source on the destination.

Even before this change (introducing the timeout) you would have likely experienced 10 to 20s delay before the transfer started due to having IPS bound that are not reachable and don’t reject (rather than drop).

kriszos · December 9, 2021, 7:52pm

so just to cover all possibilities now i have following rules on both source and destination nodes

ip6tables -A INPUT -j REJECT
ip6tables -A FORWARD -j REJECT
ip6tables -A OUTPUT -j REJECT

still got bad handshake

tomp · December 9, 2021, 7:53pm

Can you look to see if you get similar errors in the debug log as @TomvB gets. It might tell you which IP its failing on.

kriszos · December 9, 2021, 8:31pm

Source node “nazwa1” has IP 172.31.255.6/32
destination node “atl1” has IP 172.31.255.10/32
They are loopback/dummy interfaces and LXD shoul use only them to communicate as specified in my “lxc config show” posted above, and it seem that lxd is using it.
The only IPv6 addresses that i found are in YAML that, I believe, contain profile of moved container

Weird lines that i found contain addresses configured on my gre-tunnel connecting source and destination 172.20.20.92/30

t=2021-12-09T20:55:31+0100 lvl=dbug msg=“Allowing untrusted GET” ip=172.20.20.94:53604 url=“/1.0/operations/3436b227-9ad5-4edf-ab9d-1544250aab61/websocket?secret=e4db324e6313e9a6873084dce2950ec51e675ca79ccf0fb6443e0a93259586fd”
t=2021-12-09T20:55:07+0100 lvl=dbug msg=“Handling API request” ip=172.20.20.93:55096 method=GET protocol=cluster url=“/1.0/events?project=default&target=atl1” username=
t=2021-12-09T20:55:07+0100 lvl=dbug msg=“Handling API request” ip=172.20.20.93:55098 method=POST protocol=cluster url=“/1.0/instances?project=default&target=atl1” username=
t=2021-12-09T20:55:35+0100 lvl=dbug msg=“Handling API request” ip=172.20.20.93:55106 method=GET protocol=cluster url=“/1.0/operations/ed2cf7bb-d67d-4ad3-83ac-6b453c0a9756?project=default&target=atl1” username=

tho I am not very familiar with syntax of these logs an I can miss something

full log https://pastebin.com/ZaqgZYvG

tomp · December 9, 2021, 9:28pm

OK so sounds like similar issue it’s just with extra ipv4 addresses rather than ipv6 addresses like @TomvB has.

Making inbound requests to those other IPS reject connections on 8443 be rejected rather than dropped (assuming Lxd on the destination is hitting the source servers firewall) should fix it temporarily until I can figure out how to make Lxd smarter in these situations.

kriszos · December 9, 2021, 10:31pm

I don’t think it is that simple. I created the following rules
on nazwa1:

iptables --append INPUT --protocol tcp --dst 172.31.255.6 --dport 8443 --jump ACCEPT
iptables --append INPUT --protocol tcp --dport 8443 --jump REJECT

on atl1:

iptables --append INPUT --protocol tcp --dst 172.31.255.10 --dport 8443 --jump ACCEPT
iptables --append INPUT --protocol tcp --dport 8443 --jump REJECT

and nothing changed, so I added another rule to make sure that node is using correct IP while talking to other nodes

on nazwa1:

iptables -t nat -A POSTROUTING -p tcp --dport 8443 -j SNAT --to 172.31.255.6

on atl1:

iptables -t nat -A POSTROUTING -p tcp --dport 8443 -j SNAT --to 172.31.255.10

and still nothing changed, but 172.20.20.92/30 disappeared from debug log.
I don’t think it was the issue.

tomp · December 10, 2021, 12:03am

@TomvB Can you try setting for each server:

lxc config set core.https_address <ip address>:8443

To the server IP you want LXD to listen on rather than the default wildcard.

tomp · December 10, 2021, 12:22am

Yes in cluster mode LXD should really only be using the configured member’s address rather than trying them all (as in @TomvB situation).

tomp · December 10, 2021, 12:51am

One thing that is suspicious in the earlier log you sent from the destination host is:

t=2021-12-06T13:40:30+0100 lvl=info msg="Creating container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default
t=2021-12-06T13:40:38+0100 lvl=dbug msg="Database error: protocol.Error{Code:5, Message:\"database is locked\"}"
t=2021-12-06T13:40:38+0100 lvl=dbug msg="Retry failed db interaction (database is locked)"
t=2021-12-06T13:40:50+0100 lvl=info msg="Created container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default
t=2021-12-06T13:40:51+0100 lvl=dbug msg="Database error: Failed to create operation: database is locked"
t=2021-12-06T13:40:51+0100 lvl=dbug msg="Retry failed db interaction (Failed to create operation: database is locked)"
t=2021-12-06T13:40:52+0100 lvl=dbug msg="New task Operation: 4653dcf3-3b33-4d4a-b570-a2a8687876f2"
t=2021-12-06T13:40:52+0100 lvl=dbug msg="Instance operation lock finished" action=create err=nil instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 project=default reusable=false

As well as all the database locking (possibly suggesting slow I/O), the time its taking is way too long, its taking 20s from “Creating container” to “Created container”.

This would be enough time (>10s) for the websocket listener to give up and close, causing the error you see.

And in your most recent log file:

t=2021-12-09T20:55:13+0100 lvl=info msg="Creating container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default
t=2021-12-09T20:55:28+0100 lvl=info msg="Created container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default

Again, over 10s, but no DB retries this time.

Something is making that creation process (which is just adding DB records) take a long time, this is before the migration has even started.

See https://github.com/lxc/lxd/blob/22d06e2b23d7f653264861c271b75f46a627d5a7/lxd/instances_post.go#L315-L323

So I would check first the load situation on the cluster leader and on the target server, as well as any packet loss or rate limiting that may be going on between leader and target server that might slow down queries.