3 node cluster hangs when 2 nodes down

I just installed a 3 node cluster with the snap LXD package (v3.3) and ceph remote storage.
I can kill one lxd node without problem and commands like lxc list still works fine. However when I kill a second node from the cluster LXD becomes unresponsive and finally gives up with a 500 error:

#time lxc list
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error

real 1m10.078s
user 0m0.037s
sys 0m0.019s

restarting the lxd daemon with : systemctl restart snap.lxd.daemon didn’t help.
Even when there would be just 1 node of a cluster in active duty I should expect it’s own results right?

edit: when I bring back 1 lxd node, it also restores to normal behavior :slight_smile:

No, the database will refuse to work unless it’s got consensus and it can’t have consensus unless a majority of the database nodes are running, in this case 2.

It’s normal behavior for a distributed database to avoid a split-brain situation.

1 Like

Wauw that was quick :wink:
Thanks for the feedback…was starting to think about the quorum needed for ceph as well, so indeed same principle…