LXD Clustering: issues with container failover with node failure


(Michael Hoyle) #1

Three node LXD 3.01 cluster setup in AWS environment on Ubuntu 16.04 using fan bridge. Three LXCs launched and distributed across the three nodes. If node prod-b1 is shutdown, LXC first transitions to a state of error and does not spawn on one of other two nodes. Is HA not supported on LXD clustering at this point? If so, is there a reference to the config needed to support it?

$> lxc list
+--------+---------+----------------------+------+------------+-----------+----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |    TYPE    | SNAPSHOTS | LOCATION |
+--------+---------+----------------------+------+------------+-----------+----------+
| first  | RUNNING | 240.244.0.184 (eth0) |      | PERSISTENT | 1         | prod-b1  |
+--------+---------+----------------------+------+------------+-----------+----------+
| second | RUNNING | 240.101.0.188 (eth0) |      | PERSISTENT | 1         | prod-b2  |
+--------+---------+----------------------+------+------------+-----------+----------+
| third  | RUNNING | 240.204.0.234 (eth0) |      | PERSISTENT | 1         | prod-b3  |
+--------+---------+----------------------+------+------------+-----------+----------+

(Stéphane Graber) #2

LXD clustering lets you get a unified view of multiple LXD nodes, effectively turning them into one big LXD host.

The database is replicated and HA so that restarting a node will not interrupt the LXD API, you can still list containers, reconfigure them, spawn new ones, …

But anything that’s directly stored on the node that’s gone away cannot be reached until it’s back online. The remaining nodes don’t have a copy of the container’s data so can’t move and restart it.

The exception to this is if you’re using CEPH as your storage backend, in that case, since your storage is over the network and not tied to any of the nodes, you will be able to move a container from one node to another and restart it there even when the source node has gone offline.


(Michael Hoyle) #3

Thanks Stephane for the response, very helpful. Are they any primers on how to setup CEPH as the backend storage? Thanks.


(Stéphane Graber) #4

First you need a working CEPH cluster which can be a bit of a challenge on itself.
Then make sure all your nodes have working CEPH credentials and config in /etc/ceph.

After that, lxc storage create <pool name> ceph should work assuming the default CEPH cluster name. There are additional properties to specify alternate clusters and PG configurations.


(Michael Hoyle) #5

Looking through the storage options in the cluster init and CEPH is not listed, only BTRFS, ZFS, DIR. CEPH is only listed as an option when not doing a LXD cluster. Is there a way around this?


(Stéphane Graber) #7

lxd init when setting up a cluster will ask for local storage and then remote storage, CEPH falls in the latter category.


(Michael Hoyle) #8

Oh yes, found it. Configured a 3 node cluster using remote storage on CEPH. However, when I stop node 1, where the cluster was initialized and remote storage configured, I loose access to the storage from node 2 and 3. When I issue any of the lxc commands on 2 or 3, I get Error: disk I/O error. How do I configure nodes 2 and 3 to survive on CEPH without node 1? With the cluster intact and healthy, I do have a running container on each node. Your help is very much appreciated.

Chad


(Michael Hoyle) #9

Nevermind, found the issue. This is working great!


(Andras Dosztal) #10

You might want to share the solution so others with a similar problem won’t run into a dead end. :wink: