Suggestion on running 8 node cluster

Hi Everyone,

I just wanted to find out what the best approach is to run an lxd
cluster across multiple nodes in
different locations.

Location A
vm1 192.168.1.1
vm2 192.168.1.2
vm3 192.168.1.3
vm4 192.168.1.4

Stretched layer 2 across to another location:

Location B
vm5 192.168.1.5
vm6 192.168.1.6
vm7 192.168.1.7
vm8 192.168.1.8

Now, I’ve joined them all to the same cluster, if I shut down vms
gracefully in each
location(Shut down vm1,vm2,vm3,vm4) the lxd cluster is happy when I
power off all the
vms in Location A at the same time(vm1,vm2,v3,vm4) the cluster becomes
unusable, lxc list
hangs and I can not recover it. When I start vm1,vm2,vm3,vm4 again the
cluster happily returns
and everything is honkey dorey.

I did the following:
lxc config set cluster.max_voters 7
lxc config set cluster.max_standby 5

Played with different variations but no luck, the cluster becomes
really unhappy when I perform
the above. The idea is that Location A will have all of our containers
running, and Location B
will be our DR, so when A has a big issue/outage we can fire up
containers in Location B.

Will the only option be to do “lxd cluster recover-from-quorum-loss”
in the 4 vms in Location B?
We just do not want an unusable cluster in Location B if Location A
becomes unavailable.

Any advice/suggestions really would be appreciated.

Thank you in advance.

Kind Regards,
Jonathan

So there are a few of things to consider:

  1. Dqlite (and by extension LXD cluster) isn’t really designed for operating over high latency links, so what is the latency across the sites?

  2. You may also consider using Failure Domains so that the cluster roles are distributed across the two sites (see Clustering | LXD and LXD 4.4 has been released for example usage).

With this, you can tell the LXD database which systems are likely to go offline at the same time so it can make better decisions when electing a leader or promoting cluster members to different database roles.

  1. Take a look at Clustering | LXD for advice on configuring the max_voters and max_standby settings. It could be that all of your voters are in site A and when you shut it down you end up losing the majority of your cluster (which would prevent it working). This is where Failure Domains can come in useful.

Also, it would be useful to see logs from the remaining nodes when you shutdown the first site so we can see what is going on, as at this point I’ve not got much info to look at.

@mbordere do you have any suggestions? Thanks

Hi,

It looks like indeed you have 4 voters in location A and 3 in location B, when you take location A down, you have no majority anymore to promote the spare node in location B to voter and your cluster remains without a leader. You will have the same problem with every 2 location deployment I think, you will only be able to take down the location that doesn’t have a majority located within itself. If you want to be able to take down a whole location of nodes I would suggest to make 3 locations. Making use of Failure Domains as @tomp suggests, you could make sure that there are 2 locations with 2 voters and 1 location with 3 voters, in this way when you take any location down you will still have a minimum of 4 voters left, a majority (still based on max 7 voters) and your cluster will remain operational.

1 Like

Great thanks @mbordere, yes indeed, because of the possibility of a network breakages between the two sites (resulting in a partition between them), it would be always necessary to have quorum at a single site only (so that one site can continue if the other is unreachable, and the other site knows it doesn’t have quorum) - which then introduces the possibility that if you turn off the site that has quorum the other site then doesn’t know if that is due to a cluster partition or an outage, and it cannot continue to operate.

So as @mbordere says, when doing multi-site deployments, you then effectively need to treat each site as a node and ensure you can get quorum between sites by using an odd number of them (3 at minimum).

I’m interested how you were planning to utilise your fail over site, given that the instances themselves from site A are not replicated to site B? Assuming the cluster DB itself was available (which given the discussion above it won’t be), what use would a LXD site be without the instances from site A?

Thank you Mathieu and Thomas, thank you for your replies, it’s much appreciated.

My initial thought was to have 2 lxd clusters, one in site A and one in site B. So we would deploy
containers in site A and get that up and running, we would then deploy a “dummy” container on site
B with the same configuration. We would then use syncoid to send incremental zfs snapshots to site

Location A - LXD Cluster A - containerx
Location B - LXD Cluster B - containerx

Location A we would
lxc launch ubuntu:20.04 containerx --target testvm1.test-03.example.com

Location B we would
lxc launch ubuntu:20.04 containerx --target testvm1.test-04.example.com
zfs destroy ZPOOL/lxc/containers/containerx

Location A will then use syncoid to send containerx from testvm1.test-03.example.com to testvm1.test-04.example.com

From testvm1.test-03.example.com
syncoid ZPOOL/lxc/containers/containerx root@testvm1.test-04.example.com:ZPOOL/lxc/containers/containerx

There was a request to see if we can not have 1 cluster across both sites, which I was not opposed to but was not my initial thought on having offsite replication to site B.

My feeling from your feedback is to not have a single lxd cluster across both sites?

I was also looking and testing failure domains, which yielded the same result, cluster breakage. I am wondering, should I be testing with the stable LTS branch, currently using lxd 4.14 20450, or should I be upgrading it to 4.4 and testing that?

Blockquote
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| NAME | URL | DATABASE | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm1.test-03.example.com | https://172.16.48.20:8443 | YES | x86_64 | france | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm2.test-03.example.com | https://172.16.48.30:8443 | YES | x86_64 | france | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm3.test-03.example.com | https://172.16.48.31:8443 | YES | x86_64 | france | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm4.test-03.example.com | https://172.16.48.32:8443 | NO | x86_64 | france | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm5.test-03.example.com | https://172.16.48.33:8443 | YES | x86_64 | germany | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm6.test-03.example.com | https://172.16.48.34:8443 | YES | x86_64 | germany | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm7.test-03.example.com | https://172.16.48.35:8443 | YES | x86_64 | germany | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+
| testvm8.test-03.example.com | https://172.16.48.36:8443 | YES | x86_64 | germany | | ONLINE | Fully operational |
±-------------------------±--------------------------±---------±-------------±---------------±------------±-------±------------------+

Latency across the sites is about 10ms.

I am not sure if it’s worth pursuing this idea, or just going with 2 separate clusters.

Thank you again for your feedback/suggestions/information.

Kind Regards,
Jonathan

Sorry, meant upgrade to 4.15…

You’ll need to have 3 distinct sites with the cluster db roles spread over all the sites (using failure domains) in order to be able to lose entire site and keep the cluster running.