How to implement "self healing" for stateful containers on LXD clusters?

I was wondering which approach would you guys take for adding “self-healing” capabilities to an LXD cluster.

My initial thoughts are:

  • Define a fluent snapshot schedule for all containers.
  • Deploy 2 HAProxys with a heartbeat.
  • Check health of containers with heartbeat.
  • On heartbeat check fail: a) Stop taking snapshots b) Launch the last snapshot for the container.
  • Keep track of the last snapshot used, so you don’t eternally re-load a failing snapshot.
  • Update HAProxy with the new IP of the container.

I would question the usefulness of a snapshot / image “self healing” my main problems being;

  1. How do you know which snapshot is good / bad ?
  2. How do you know X snapshot doesn’t lead to the same problem in N time ? (causing an infite loop of re-create, make available, lose data again)
  3. The data loss from N minutes of change surely matters (in a stateful container) ?
  4. How do you define, group & monitor redeployment of containers ? (in looping situations)
  5. What if you loose a whole cluster member (can a stateful container be re-deployed to anwother host?)

Could the container not be re-created from an known “good” state (E.G a base image like ubuntu / centos) and cloud-init applied on top ? (opposed to guessing which snapshots are in a good working state) ?

How do you mark a snapshot as “last used & good” without manual intervention (should a snapshot be checked manually every N minutes? )

The problems with stateful containers (like MySQL) can be easily be removed by setting the data directory to a shared disk that would be available to re-created containers (accross many hosts not just one member)

Understood @turtle0x1 , anyway, let me reply back with what I had in mind…

1,2. How do you know which snapshot is good / bad ?
Latest is the good one unless it fails, if so, we load the previous one and so on (let’s say after 5 attempts the process stops and sysadmin is notified) (more complex options below, based on your suggestions).

  1. The data loss from N minutes of change surely matters (in a stateful container) ?
    It certainly does, but I don’t think this can be prevented with LXD, self-healing or not. HAProxy should be taking care of service continuity and DB replication should take care of the data integrity, I just want to restore the “service” to it’s intended level of performance, Kubernetes style, where you define a # of containers (pods) and K8S will keep an eye on these and make sure to bring up new containers if required. (Maybe I should check how their algorithm defines healthy and unhealthy)

4.How do you define, group & monitor redeployment of containers ?
The idea is to have a maximum of one container per node for each “service” so each failing container should be re-created in the same node or some other node without an instance of that particular service.

Let’s say we have 5 nodes, also assume we host 3 Apache containers per domain and 3 DB containers too, each DB container is “tied” to one particular Apache container (preferably DB container is in the same host as it’s corresponding Apache container).
(or maybe both Apache and Mysql should both run within the same container: LXD Performance: Stack per container VS service per container).
When container dom1.com dies, a new one should be deployed, either on the same node or probably even better in a node which doesn’t yet have an instance of dom1.com
Afterward (not urgent, maybe a different script/process) we check if the DB container is in the same host as its corresponding HTTP container and if not we move it.

NOTE: Now that I lay that structure out, I think it might make sense, in some situations, to consider cloning one of the other 2 running containers for dom1.com instead of the latest snapshot…

On the DB side, if a container fails, we launch a DB container without data on it and then ask one of the other 2 containers to replicate the data.

  1. What if you lose a whole cluster member.
    Yes, that is a huge concern indeed, that’s why a minimum of 3 (but 5 if possible) nodes would be required.

Could the container not be re-created from an known “good” state ?
Taking note of the cloud-init possibility, I guess if we could somehow know what the issue is that caused the disruption, we would be able to determine the best course of action (snapshot vs clone active vs cloud-init)

HAProxy “checks” (or similar, maybe just an HTTP request that calls an “am I healthy?” script on the HTTP container, or a simple query for the DB containers) + some DB to store that data?

The problems with stateful containers (like MySQL) can be easily be removed by setting the data directory to a shared disk.
Understood, do you know a good guide on how to do that?

Thanks a lot for your detailed answer @turtle0x1!

PD: What about having “dormant” (stopped) copies of a container in the nodes where a given service is not currently running for even shorter MTTR.
If dom1.com is running on nodes 1-3-5, we keep stopped clones of that container on nodes 2 & 4 which are (supossed to be) containing a healthy instance of the service.