LXD Clustering: issues with container failover with node failure

ncpe2001 · July 3, 2018, 8:22pm

Three node LXD 3.01 cluster setup in AWS environment on Ubuntu 16.04 using fan bridge. Three LXCs launched and distributed across the three nodes. If node prod-b1 is shutdown, LXC first transitions to a state of error and does not spawn on one of other two nodes. Is HA not supported on LXD clustering at this point? If so, is there a reference to the config needed to support it?

$> lxc list
+--------+---------+----------------------+------+------------+-----------+----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |    TYPE    | SNAPSHOTS | LOCATION |
+--------+---------+----------------------+------+------------+-----------+----------+
| first  | RUNNING | 240.244.0.184 (eth0) |      | PERSISTENT | 1         | prod-b1  |
+--------+---------+----------------------+------+------------+-----------+----------+
| second | RUNNING | 240.101.0.188 (eth0) |      | PERSISTENT | 1         | prod-b2  |
+--------+---------+----------------------+------+------------+-----------+----------+
| third  | RUNNING | 240.204.0.234 (eth0) |      | PERSISTENT | 1         | prod-b3  |
+--------+---------+----------------------+------+------------+-----------+----------+

stgraber · July 3, 2018, 10:44pm

LXD clustering lets you get a unified view of multiple LXD nodes, effectively turning them into one big LXD host.

The database is replicated and HA so that restarting a node will not interrupt the LXD API, you can still list containers, reconfigure them, spawn new ones, …

But anything that’s directly stored on the node that’s gone away cannot be reached until it’s back online. The remaining nodes don’t have a copy of the container’s data so can’t move and restart it.

The exception to this is if you’re using CEPH as your storage backend, in that case, since your storage is over the network and not tied to any of the nodes, you will be able to move a container from one node to another and restart it there even when the source node has gone offline.

ncpe2001 · July 5, 2018, 3:38am

Thanks Stephane for the response, very helpful. Are they any primers on how to setup CEPH as the backend storage? Thanks.

stgraber · July 6, 2018, 5:43am

First you need a working CEPH cluster which can be a bit of a challenge on itself.
Then make sure all your nodes have working CEPH credentials and config in /etc/ceph.

After that, lxc storage create <pool name> ceph should work assuming the default CEPH cluster name. There are additional properties to specify alternate clusters and PG configurations.

ncpe2001 · July 7, 2018, 11:41pm

Looking through the storage options in the cluster init and CEPH is not listed, only BTRFS, ZFS, DIR. CEPH is only listed as an option when not doing a LXD cluster. Is there a way around this?

stgraber · July 8, 2018, 12:50am

lxd init when setting up a cluster will ask for local storage and then remote storage, CEPH falls in the latter category.

ncpe2001 · July 8, 2018, 12:15pm

Oh yes, found it. Configured a 3 node cluster using remote storage on CEPH. However, when I stop node 1, where the cluster was initialized and remote storage configured, I loose access to the storage from node 2 and 3. When I issue any of the lxc commands on 2 or 3, I get Error: disk I/O error. How do I configure nodes 2 and 3 to survive on CEPH without node 1? With the cluster intact and healthy, I do have a running container on each node. Your help is very much appreciated.

Chad

ncpe2001 · July 8, 2018, 1:20pm

Nevermind, found the issue. This is working great!

adosztal · July 13, 2018, 1:41pm

You might want to share the solution so others with a similar problem won’t run into a dead end.

Sinaly_Diawara · August 2, 2019, 5:47pm

Hi,
Can you please post your procedure for others users in the same situation ?

maccumaccu · October 8, 2019, 12:56pm

Hi,
Can you please post your procedure for others users in the same situation ?

jhoney · October 14, 2019, 4:18pm

How to setup CEPH as the backend storage !!

bruce78 · October 15, 2019, 6:29am

I think you need at least 3 servers to get HA going???