Lxc move error websocket: bad handshake

Yeah, it works!

1 Like

This log is from @kriszos ? (Edit, yes it is.)
The set core.https_address works for me. This is also fine if this is a solution. I’ll include it in my documentation.

fyi:
host-01

config:
core.https_address: 192.168.1.1:8443
core.trust_password: true

host-02

config:
core.https_address: 192.168.1.2:8443
core.trust_password: true

1 Like

@TomvB i am glad your issues are resolved

@tomp
As it is not so easy to check which node is current database leader (I’m grateful that feature is introduced in 4.21), I did the following tasks:
I stopped lxd service on all other nodes besides the 2 that I am testing on, so one of them can become database leader.
Stopped all running containers.
Checked load on both nodes, which at this point was almost nonexistent.
Checked iops with “iostat”, which at this point was also minimal.
Checked random I/O speed on disks with “fio” which could be better but is nothing that would prevent creating database record:
source read 39.1MiB/s write 13.1MiB/s
destination read 118MiB/s write 39.4MiB/s
checked connections speed between nodes using “iperf3” which was 200Mb/s
checked packet loss with “ping -f” with stable 29ms rtt I lost 2/10000 packets, so no significant packet loss,

try to move a containers still end with:
Error: Copy instance operation failed: Failed instance creation: Error transferring instance data: websocket: bad handshake

I have no idea what else could it be.

Can you show lxc config show <instance> --expanded for the container you’re trying to move?

Also, if you create a new instance on the target cluster member does it also take a long time?

creating new instance on target node takes 29 seconds, on source node is instantaneous max 1 second, on other nodes it takes 9-14 seconds

architecture: x86_64
config:
cluster.evacuate: migrate
image.architecture: amd64
image.description: Debian bullseye amd64 (20211213_06:39)
image.os: Debian
image.release: bullseye
image.serial: “20211213_06:39”
image.type: squashfs
image.variant: cloud
user.network-config: |
version: 1
config:
- type: physical
name: enp5s0
subnets:
- type: static
ipv4: true
address: 10.1.2.99
netmask: 32
gateway: 10.1.1.1
control: auto
- type: static6
ipv6: true
accept-ra: false
address: 2001:470:71:fdf::99
netmask: 128
gateway: 2001:470:65ec:1011::1
control: auto
- type: nameserver
address:
- 2001:470:71:fdf::12
- 10.1.2.12
search:
- ad.kriszos.pl
user.vendor-data: |
#cloud-config
#package_upgrade: true
packages:
- openssh-server
- nano
timezone: Europe/Warsaw
disable_root: false
#ssh_pwauth: yes
users:
- name: root
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAwAIXmVfMRgo7EQS/u7kTJ0s0Mr8FZiTfbEsdNhXOUKUuwSyidAo7zuIurEf+xaHLhzW3nfPqhTHZFRbbWORKRkbJsxCzBKr4mxOo/mZ2f0AFqZrTWdxR31aun2Ql3lZPoYAfA+NrfzFVnwAPLzC4jEcnvC+4J/fF0tsRBKSiRQCYR8rBTQWTPvfy2ZipLJn0DgBwCjWXUveJ1M/DI47+W3dgNaM48aWW8Po/UMNIJlsa6+fa4TRdjB1z/HjjTPg0XHgV7W30kPfxhnv86pkO4x9wE4TZBhHGxJotaoe511reCoc6DTMyv6SXg07VlrXI2I/8W1OV/IC+KJwSRpPZUw== rsa-key-20190814
#lock_passwd: false
#hashed_passwd: $1$GFy7utiu$Lkmt3eXTBpST8pkzTJCqY1
shell: /bin/bash
runcmd:
#bellow_for_passwords
#- sed -i -e ‘/^#PermitRootLogin/s/^.*$/PermitRootLogin yes/’ /etc/ssh/sshd_config
#- systemctl reload ssh
#
#below_for_centos
- ping -c 5 10.1.1.1
- ping -c 5 2001:470:65ec:1011::1
- systemctl start sshd
- systemctl enable sshd
#power_state:
#mode: reboot
#condition: true
volatile.apply_template: create
volatile.base_image: 38c4676659ae9dddd19747f2fd92a37798a90f5d09e4d2e49fa615cdb35bf6a7
volatile.eth0.hwaddr: 00:16:3e:c8:15:cb
volatile.idmap.base: “0”
volatile.idmap.next: ‘[{“Isuid”:true,“Isgid”:false,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000},{“Isuid”:false,“Isgid”:true,“Hostid”:1000000,“Nsid”:0,“Maprange”:1000000000}]’
volatile.last_state.idmap: ‘
volatile.uuid: 7e6f16ab-4319-4e51-8540-e1f6da1a79a1
devices:
eth0:
ipv4.routes.external: 10.1.2.99/32
ipv6.routes.external: 2001:470:71:fdf::99/128
name: enp5s0
network: lxdbr0
type: nic
root:
path: /
pool: local
type: disk
ephemeral: false
profiles:

  • def
  • ip-99
    stateful: false
    description: “”

What storage pool driver are you using?

Can you show the debug log output when you try and create an instance on the target member?

zfs

I can’t tell from the logs, but there is certainly some performance problem affecting operations on the target server.

For example, these sorts of errors indicate either contention or slow operations on the databasa

t=2021-12-13T23:37:32+0100 lvl=dbug msg="Retry failed db interaction (Error adding configuration item \"volatile.eth0.hwaddr\" = \"00:16:3e:30:e5:59\" to instance 800: database is locked)" 

Also, the time between these 2 entries should be effectively zero, as they are just DB operations, but on your system its 3 seconds.

t=2021-12-13T23:37:26+0100 lvl=info msg="Creating container" ephemeral=false instance=test1 instanceType=container project=default
t=2021-12-13T23:37:29+0100 lvl=dbug msg="FillInstanceConfig started" driver=zfs instance=test1 pool=local project=default

For comparison on a 3 member cluster inside VMs on my own system I see this when creating an instance on a non-leader member:

Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=info msg="Creating container" ephemeral=false instance=c1 instanceType=container project=default
Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=dbug msg="FillInstanceConfig started" driver=dir instance=c1 pool=local project=default
Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=dbug msg="FillInstanceConfig finished" driver=dir instance=c1 pool=local project=default

Given that the timeout for the target to connect back to the source to start the migration is 10s, losing 3s just on DB record creation indicates there is some performance issue.

My recommendation would be to upgrade to LXD 4.21 for the following reasons:

  • It reduces the amount of leader DB activity when idle - so if your systems are resource constrained this may help reduce contention.
  • It allows you to see which member is the leader via lxc cluster ls.

Got same issue since mid-November (similar environment as poster, with two hosts not in cluster running LXD 4.21, the sending host having multiple network cards), and solved applying

lxc config set core.https_address <ip address>:8443

on the sender.

Great to hear that works, as that is the correct way to configure LXD to avoid delays and (now) timeouts, if not all IPs that LXD is listening are reachable from all hosts.