Suggestion on running 8 node cluster

So there are a few of things to consider:

  1. Dqlite (and by extension LXD cluster) isn’t really designed for operating over high latency links, so what is the latency across the sites?

  2. You may also consider using Failure Domains so that the cluster roles are distributed across the two sites (see Linux Containers - LXD - Has been moved to Canonical and LXD 4.4 has been released for example usage).

With this, you can tell the LXD database which systems are likely to go offline at the same time so it can make better decisions when electing a leader or promoting cluster members to different database roles.

  1. Take a look at Linux Containers - LXD - Has been moved to Canonical for advice on configuring the max_voters and max_standby settings. It could be that all of your voters are in site A and when you shut it down you end up losing the majority of your cluster (which would prevent it working). This is where Failure Domains can come in useful.

Also, it would be useful to see logs from the remaining nodes when you shutdown the first site so we can see what is going on, as at this point I’ve not got much info to look at.

@mbordere do you have any suggestions? Thanks