Lxd process on the database leader is quite demanding

henning · June 8, 2022, 9:27am

Hi there,

first of all: LXD is great. Thanks for your hard work!

I’m running a lxd cluster with 15 nodes. All nodes are KVM VMs with Ubuntu 22.04 and latest LXD installed via Snap.

There are ~950 relatively small containers on the cluster which all have the same software installed (Apache, PHP, Asterisk) and all containers are running but doing nothing.

lxc list takes about 50 seconds to return a result.

On lxc list containers on about half of the nodes are shown in ERROR state although they are running and working correctly.

Load on the database leader is around 2. top shows lxd consuming > 100 % CPU most of the time. Load on other nodes is around 0.5.

I noticed high outgoing network traffic on the database leader. iftop shows almost 2 GB of data sent to other nodes in the cluster within 1 minute.

After systemctl reload snap.lxd.daemon on the database leader load and network traffic goes down and another node becomes the leader showing the same thing (load going up, CPU usage of lxd process > 100 %, high outgoing network traffic).

Is there anything I can do to improve my installation or is that expected behavior?

tomp · June 9, 2022, 10:25am

Which version of LXD are you using? LXD 5.2 had some database optimisations to try and reduce cross-cluster DB traffic and query amounts.

tomp · June 9, 2022, 10:25am

Also are all your cluster members in the same network segment and have low latency between them?

tomp · June 9, 2022, 10:26am

And does lxc list --fast work OK?

henning · June 10, 2022, 3:00pm

Thanks for your reply!

lxd --version
5.2

Tested the list command again. This time lxd list took 60 seconds reporting most containers in error state while lxd list --fast took only 6 seconds reporting the correct state. Cool hack, thanks!

All nodes are on the same network. The VMs are distributed over 3 hardware nodes sitting in the same rack connected via 10 GBit to each other.

This morning I noticed that one of the nodes is running on slow hdd storage while all other nodes are running on SSD network storage. This is a mistake which I will fix this weekend and try again. Will post the result here.

Have a good weekend everybody!

Cheers
Henning