Production readyness of an LXD cluster

Neev · January 30, 2020, 1:15pm

I’ve had troubles keeping a cluster up and running for more than one day.
My proof of concept setup is composed of three nodes connected on the same switch running with lxd 3.18 (and now 3.19), openvswitch taking their nic over in order to give a transparent access to every container (they receive the DHCP configuration from a separated server), and juju to deploy applications.

The problem is the cluster is unstable, will not survive a reboot, and sometimes will not survive its own existence.
I’ve had every kind of issues, ranging from core_address being randomly deleted in the local.db database, to troubles even starting lxd.
More often than not it was unix socket related but some odd errors shown up too (some saying they’re not errors).

What have i been doing wrong?
Have any of you met this kind of issues and overcome them (since lxd is supposed to be production ready)?

freeekanayaka · January 30, 2020, 5:31pm

Please check out the 3.20 release coming out today, it has a number of improvements around cluster reboot. If you still have problems with that version, just follow up here with the details.

Neev · February 10, 2020, 9:09am

Hi, back from the field… no dice, sorry.

The cluster survived five days (until friday), and today as i checked it all nodes are down, containers are still running but the daemon doesn’t answer my calls.

Again the --debug launch of lxd shows the same error :
EROR[02-10|09:57:37] Failed to start the daemon: Listen to cluster address: listen tcp 10.30.0.8:8443: bind: address already in use

And again the database has been tampered with by some dark magic:
INSERT INTO config VALUES(1,‘cluster.https_address’,‘10.30.0.8:8443’);
INSERT INTO config VALUES(3,‘core.https_address’,’[::]’);

Can i set local.db as immutable? Or maybe you can help me find what modifies it?
Thanks in advance.

freeekanayaka · February 10, 2020, 9:52am

Here I think there’s a case we don’t handle: essentially we should detect that [::] and 10.30.0.8:8443 are in fact the same address, and we shouldn’t try to bind 10.30.0.8:8443 since it’s already “covered” by [::] (since we internally expand [::] to [::]:8443, using the default port).

I’ll get this fixed in the code. In the meantime, you can run this on each the node having this issue:

lxc config set core.https_address "[::]:8443"

and you should be good.

freeekanayaka · February 10, 2020, 9:59am

See https://github.com/lxc/lxd/pull/6859 for the fix (will be included in 3.21).

Neev · February 10, 2020, 11:20am

Good to hear!

Tony_Anytime · February 11, 2020, 5:33am

I have had same problem - it persists in 3.20 but here is what I do to get them running again.
Try this script on all your machines on boot and see what happens
echo ‘Reseting Networking’
systemctl restart systemd-networkd
sleep 10
echo ‘Stopping Socket’
systemctl stop snap.lxd.daemon.unix.socket
sleep 10
echo ‘Starting Unix Socket’
systemctl start snap.lxd.daemon.unix.socket
sleep 10
echo ‘Stopping lxd’
systemctl stop snap.lxd.daemon
echo ‘Starting lxd’
systemctl restart snap.lxd.daemon
sleep 10
lxc cluster list

The whole conversation is here… is if this helps/

Neev · February 12, 2020, 9:48am

Yes, thanks for your answer Tony.
I had tried the restarting of the unix socket service / lxd daemon too but had no positive results.
Thing is the database is emptied hence making the socket unable to connect.
2,‘core.https_address’,’10.30.0.8:8443’ turns into 3,‘core.https_address’,’[::]’ and “lxd --debug --group lxd” displays the error messages you can see earlier.

The solution for me, since the daemon doesn’t take my calls into consideration is to update the table using sqlite3. Then restarting the socket and daemon and everything is fine.

The actual issue for me is the instability of the cluster. Knowing how to fix it isn’t an excuse to build your infrastructure on something that will crash if you look at it wrong.

freeekanayaka · February 12, 2020, 10:02am

Assuming that you have changed ‘core.https_address’ from ‘[::]’ to ‘[::]:8443’ (either with a manual SQL query or with the lxc config set command that I had recommended), are you still seeing issues? If yes, please let us know so we can fix them.

As mentioned, 3.21 will have a fix for the ‘[::]’ issue, so changing it to ‘[::]:8443’ will no longer be necessary.

Neev · February 12, 2020, 10:09am

I haven’t changed it to [::]:8443 but to 10.30.0.8:8443 (by the way the daemon hangs so sqlite is the only way). And the problem was solved… temporarily. The database seems to like having ‘[::]’ as value for the core address.

Neev · March 4, 2020, 9:00am

So !

This problem seems to be solved, the cluster survived three whole weeks, seems to have updated itself from 3.20 candidate to 3.21 stable, and was still running until i broke it (it seems nodes do not like being restarted with running containers in them).

Well done !