Ah so that somewhat changes your earlier statement of “Firewall ports are open between the servers.” When we were talking about if the ipv6 address was allowed, or whether if not, could you make your firewall reject rather than drop.
I do not use and allow public IPv6 communication between the servers. The servers are standalone with an internal switch in between with only IPv4 configured.
A public IPv4 and IPv6 address is configured on the nodes. I do not use this for copying containers and communication should not go through the public addresses. This is by default not allowed.
The VLAN (internal switch) is dedicated for containers backups with lxc copy.
@stgraber Do you have any idea? I have no idea what has been changed in the past few weeks.
I’ve already explained my working theory as to the problem and the change here
And here
If you have ipv6 addresses bound to the host lxd will try and use them. If the firewall is then dropping rather than rejecting the request this could take longer than 10s and cause the web socket listeners to time out.
I was asking if you could setup some ipv6 reject rules to test that theory before we start thinking about a fix in Lxd. Does that make sense?
Its similar error but different cause to that earlier issue.
The reason you experience these issues is that LXD tries to discover a working set of communication channels by trying all IPS from the source on the destination.
Even before this change (introducing the timeout) you would have likely experienced 10 to 20s delay before the transfer started due to having IPS bound that are not reachable and don’t reject (rather than drop).
Source node “nazwa1” has IP 172.31.255.6/32
destination node “atl1” has IP 172.31.255.10/32
They are loopback/dummy interfaces and LXD shoul use only them to communicate as specified in my “lxc config show” posted above, and it seem that lxd is using it.
The only IPv6 addresses that i found are in YAML that, I believe, contain profile of moved container
Weird lines that i found contain addresses configured on my gre-tunnel connecting source and destination 172.20.20.92/30
OK so sounds like similar issue it’s just with extra ipv4 addresses rather than ipv6 addresses like @TomvB has.
Making inbound requests to those other IPS reject connections on 8443 be rejected rather than dropped (assuming Lxd on the destination is hitting the source servers firewall) should fix it temporarily until I can figure out how to make Lxd smarter in these situations.
One thing that is suspicious in the earlier log you sent from the destination host is:
t=2021-12-06T13:40:30+0100 lvl=info msg="Creating container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default
t=2021-12-06T13:40:38+0100 lvl=dbug msg="Database error: protocol.Error{Code:5, Message:\"database is locked\"}"
t=2021-12-06T13:40:38+0100 lvl=dbug msg="Retry failed db interaction (database is locked)"
t=2021-12-06T13:40:50+0100 lvl=info msg="Created container" ephemeral=false instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 instanceType=container project=default
t=2021-12-06T13:40:51+0100 lvl=dbug msg="Database error: Failed to create operation: database is locked"
t=2021-12-06T13:40:51+0100 lvl=dbug msg="Retry failed db interaction (Failed to create operation: database is locked)"
t=2021-12-06T13:40:52+0100 lvl=dbug msg="New task Operation: 4653dcf3-3b33-4d4a-b570-a2a8687876f2"
t=2021-12-06T13:40:52+0100 lvl=dbug msg="Instance operation lock finished" action=create err=nil instance=lxd-move-of-8ab127d5-46d7-4501-95e0-706e29e41678 project=default reusable=false
As well as all the database locking (possibly suggesting slow I/O), the time its taking is way too long, its taking 20s from “Creating container” to “Created container”.
This would be enough time (>10s) for the websocket listener to give up and close, causing the error you see.
So I would check first the load situation on the cluster leader and on the target server, as well as any packet loss or rate limiting that may be going on between leader and target server that might slow down queries.
This log is from @kriszos ? (Edit, yes it is.)
The set core.https_address works for me. This is also fine if this is a solution. I’ll include it in my documentation.
@tomp
As it is not so easy to check which node is current database leader (I’m grateful that feature is introduced in 4.21), I did the following tasks:
I stopped lxd service on all other nodes besides the 2 that I am testing on, so one of them can become database leader.
Stopped all running containers.
Checked load on both nodes, which at this point was almost nonexistent.
Checked iops with “iostat”, which at this point was also minimal.
Checked random I/O speed on disks with “fio” which could be better but is nothing that would prevent creating database record:
source read 39.1MiB/s write 13.1MiB/s
destination read 118MiB/s write 39.4MiB/s
checked connections speed between nodes using “iperf3” which was 200Mb/s
checked packet loss with “ping -f” with stable 29ms rtt I lost 2/10000 packets, so no significant packet loss,
try to move a containers still end with:
Error: Copy instance operation failed: Failed instance creation: Error transferring instance data: websocket: bad handshake