The Events API (Websocket is Not Reliable)

devsrealm · May 28, 2023, 6:34am

I was surprised that there isn’t any webhook support since it is easy to implement and it can make life easy for consumers, however, that isn’t much of a problem for me, I started looking for events endpoints if there is any, which fortunately I found one, but unfortunately is for WebSocket.

WebSocket although see their uses for real-time communication (e.g, the console end-point), they are not reliable in terms of message delivery that does not pertain to real-time communication, I don’t need real-time communication when an instance is being created, just let me know the request is successful, and give me an end-point that deals with all events which I can further distill later.

With WebSocket, if a message is lost during transmission or the server can’t send it or the client for some unforeseen reason, can’t handle the response, is there any retry mechanism?

It would be more beneficial to have an events endpoint that provides a reliable and robust method of retrieving container-related events that do not solely rely on WebSocket, which would be an alternative to WebSocket.

This approach could include a cursor-based pagination system, where events are returned based on a specified cursor position. By implementing this approach, you would have the ability to consume events selectively, based on your requirements and preferences.

Utilizing an events endpoint would not only ensure reliability but also offer robustness. With a dedicated endpoint, consumers would have a centralized source of events, making it easier to track and manage events effectively.

Here is just a hypothetical example that returns a list of events that has occurred:

{
  "events": [
    {
      "id": "1",
      "timestamp": "2023-05-28T10:00:00Z",
      "type": "container_created",
      "container_id": "container-1",
      "message": "Container 'container-1' has been created."
    },
    {
      "id": "2",
      "timestamp": "2023-05-28T10:05:00Z",
      "type": "container_started",
      "container_id": "container-1",
      "message": "Container 'container-1' has started."
    },
    {
      "id": "3",
      "timestamp": "2023-05-28T10:10:00Z",
      "type": "container_stopped",
      "container_id": "container-1",
      "message": "Container 'container-1' has stopped."
    }
  ],
  "cursor": "eyJpZCI6Mn0="
}

The cursor field indicates the position of the next set of events. In this case, the cursor value is eyJpZCI6Mn0=. To fetch the next set of events, you would include this cursor value in your subsequent request: GET /events?cursor=eyJpZCI6Mn0=

The response would contain the next set of events along with a new cursor, enabling you to continue paginating through the events.

This gives granular control over which events to retrieve and consume, this does not have to be a cursor-based pagination, it can be anything as long as it supports a way to move forward and backward, with support for what type of events to query.

tomp · May 28, 2023, 10:05am

@stgraber posts here sum things up on this topic:

github.com/lxc/lxd

Add support to call external hooks

opened 03:15PM - 07 Jun 17 UTC

cryptk

Feature Documentation Maybe

It would be nice if LXD had support for a few different areas to add in "hooks" …to have extra logic run, One use could potentially be having a hook script fire off an alert into a monitoring system when a container move starts, and then another hook script to resolve that alert when the container move completes. What I think my "best case scenario" for these hooks would be is adding a pre- and post- hook to every action, as well as one additional hook (the one I am really interested in having) which is one that happens during a live migration just before the CRIU snapshot is thawed on the destination server. Each hook should be able to run both async as well as synchronously (a toggle value when the hook is defined, likely defaulting to async if unset). My end goal would be to be able to run some logic (perhaps to move an ISCSI connection to the new host or something) before the live migration completes.

github.com/lxc/lxd

Add support to call external hooks

opened 03:15PM - 07 Jun 17 UTC

cryptk

Feature Documentation Maybe

It would be nice if LXD had support for a few different areas to add in "hooks" …to have extra logic run, One use could potentially be having a hook script fire off an alert into a monitoring system when a container move starts, and then another hook script to resolve that alert when the container move completes. What I think my "best case scenario" for these hooks would be is adding a pre- and post- hook to every action, as well as one additional hook (the one I am really interested in having) which is one that happens during a live migration just before the CRIU snapshot is thawed on the destination server. Each hook should be able to run both async as well as synchronously (a toggle value when the hook is defined, likely defaulting to async if unset). My end goal would be to be able to run some logic (perhaps to move an ISCSI connection to the new host or something) before the live migration completes.

Theres also the issue of persistence because what you’re describing would require LXD to persist the events into a store somehow which is not something we are keen for LXD to morph into.

What could be an approach is that on reconnection you gather the current state of instances and adjust your records appropriately.

If the event consumer was local the chance of network issues causing disconnects are reduced and you’re then free to buffer them in whatever way is best for you.

devsrealm · May 28, 2023, 11:01am

Well, LXD is already using a store (S(D)QLite), and it persists all kinds of data from instances to images, etc, so, adding one that specifically deals with storing events shouldn’t be a problem.

The thing is, having dealt with millions of WebSocket data, you just can go around not having stale data, it is inevitable, something would fail either from the server or the client doing stupid things.

There should be a way for better introspection, this doesn’t only give the consumer a chance to replay the skipped events, event audit replay, guarantee that something actually occurs, but the implementation on both ends is easy.

I am using lxd for cloud offering, so, this won’t do it at scale.

WebSocket is fine for real-time message passing and or receiving, I see it uses, however, it is not reliable, it doesn’t guarantee that the message would be delivered and I believe naming the endpoint events is a bit misleading.

For the most part, it is overkill to use websocket just to check if a container has changed it status from stopped to start.

You can even build it in such a way that event older than 30 or so days would be pruned.

I get that the project can’t satisfy everyone, we all have our preference and that’s fine. Just suggesting something I feel should work better.

tomp · May 28, 2023, 11:13am

Yes, we do store config data, but not log data (which can get large and incur the overhead of lots of concurrent writes which dqlite is not well suited for).

We do support pushing logs into Loki, which I think has some limited retry mechanism in it.

tomp · May 28, 2023, 11:16am

Take a look at the discussion on [LXD] Stream lifecycle and log events to Loki

There is a comment around parsing the local log files or setting up systemd logging which may be of use.

devsrealm · May 28, 2023, 11:27am

Yh, this is what I am already doing manually, I would write the complete steps once I am done, the way it works, is, I’ll create a container that specifically deals with my proposed event method.

I’ll include a small RDMS (SQLite or any small one) in the container, create a SystemD service that pipes the lxc monitor -f json --type=lifecycle to a script in the container, this dumps each log data in the table, the good thing is, SystemD would handle the restart and pruning of the events table in the container every certain days, this way, it doesn’t grow too much.

Then that can periodically get the event for me using the exec option directly from the API, still working on it though, just a sketch.