Lxc move error websocket: bad handshake

tomp · December 14, 2021, 9:23am

I can’t tell from the logs, but there is certainly some performance problem affecting operations on the target server.

For example, these sorts of errors indicate either contention or slow operations on the databasa

t=2021-12-13T23:37:32+0100 lvl=dbug msg="Retry failed db interaction (Error adding configuration item \"volatile.eth0.hwaddr\" = \"00:16:3e:30:e5:59\" to instance 800: database is locked)"

Also, the time between these 2 entries should be effectively zero, as they are just DB operations, but on your system its 3 seconds.

t=2021-12-13T23:37:26+0100 lvl=info msg="Creating container" ephemeral=false instance=test1 instanceType=container project=default
t=2021-12-13T23:37:29+0100 lvl=dbug msg="FillInstanceConfig started" driver=zfs instance=test1 pool=local project=default

For comparison on a 3 member cluster inside VMs on my own system I see this when creating an instance on a non-leader member:

Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=info msg="Creating container" ephemeral=false instance=c1 instanceType=container project=default
Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=dbug msg="FillInstanceConfig started" driver=dir instance=c1 pool=local project=default
Dec 14 09:21:36 v1 lxd.daemon[997]: t=2021-12-14T09:21:36+0000 lvl=dbug msg="FillInstanceConfig finished" driver=dir instance=c1 pool=local project=default

Given that the timeout for the target to connect back to the source to start the migration is 10s, losing 3s just on DB record creation indicates there is some performance issue.

My recommendation would be to upgrade to LXD 4.21 for the following reasons:

It reduces the amount of leader DB activity when idle - so if your systems are resource constrained this may help reduce contention.
It allows you to see which member is the leader via lxc cluster ls.