The Events API (Websocket is Not Reliable)

I was surprised that there isn’t any webhook support since it is easy to implement and it can make life easy for consumers, however, that isn’t much of a problem for me, I started looking for events endpoints if there is any, which fortunately I found one, but unfortunately is for WebSocket.

WebSocket although see their uses for real-time communication (e.g, the console end-point), they are not reliable in terms of message delivery that does not pertain to real-time communication, I don’t need real-time communication when an instance is being created, just let me know the request is successful, and give me an end-point that deals with all events which I can further distill later.

With WebSocket, if a message is lost during transmission or the server can’t send it or the client for some unforeseen reason, can’t handle the response, is there any retry mechanism?

It would be more beneficial to have an events endpoint that provides a reliable and robust method of retrieving container-related events that do not solely rely on WebSocket, which would be an alternative to WebSocket.

This approach could include a cursor-based pagination system, where events are returned based on a specified cursor position. By implementing this approach, you would have the ability to consume events selectively, based on your requirements and preferences.

Utilizing an events endpoint would not only ensure reliability but also offer robustness. With a dedicated endpoint, consumers would have a centralized source of events, making it easier to track and manage events effectively.

Here is just a hypothetical example that returns a list of events that has occurred:

{
  "events": [
    {
      "id": "1",
      "timestamp": "2023-05-28T10:00:00Z",
      "type": "container_created",
      "container_id": "container-1",
      "message": "Container 'container-1' has been created."
    },
    {
      "id": "2",
      "timestamp": "2023-05-28T10:05:00Z",
      "type": "container_started",
      "container_id": "container-1",
      "message": "Container 'container-1' has started."
    },
    {
      "id": "3",
      "timestamp": "2023-05-28T10:10:00Z",
      "type": "container_stopped",
      "container_id": "container-1",
      "message": "Container 'container-1' has stopped."
    }
  ],
  "cursor": "eyJpZCI6Mn0="
}

The cursor field indicates the position of the next set of events. In this case, the cursor value is eyJpZCI6Mn0=. To fetch the next set of events, you would include this cursor value in your subsequent request: GET /events?cursor=eyJpZCI6Mn0=

The response would contain the next set of events along with a new cursor, enabling you to continue paginating through the events.

This gives granular control over which events to retrieve and consume, this does not have to be a cursor-based pagination, it can be anything as long as it supports a way to move forward and backward, with support for what type of events to query.

@stgraber posts here sum things up on this topic:

Theres also the issue of persistence because what you’re describing would require LXD to persist the events into a store somehow which is not something we are keen for LXD to morph into.

What could be an approach is that on reconnection you gather the current state of instances and adjust your records appropriately.

If the event consumer was local the chance of network issues causing disconnects are reduced and you’re then free to buffer them in whatever way is best for you.

Well, LXD is already using a store (S(D)QLite), and it persists all kinds of data from instances to images, etc, so, adding one that specifically deals with storing events shouldn’t be a problem.

The thing is, having dealt with millions of WebSocket data, you just can go around not having stale data, it is inevitable, something would fail either from the server or the client doing stupid things.

There should be a way for better introspection, this doesn’t only give the consumer a chance to replay the skipped events, event audit replay, guarantee that something actually occurs, but the implementation on both ends is easy.

I am using lxd for cloud offering, so, this won’t do it at scale.

WebSocket is fine for real-time message passing and or receiving, I see it uses, however, it is not reliable, it doesn’t guarantee that the message would be delivered and I believe naming the endpoint events is a bit misleading.

For the most part, it is overkill to use websocket just to check if a container has changed it status from stopped to start.

You can even build it in such a way that event older than 30 or so days would be pruned.

I get that the project can’t satisfy everyone, we all have our preference and that’s fine. Just suggesting something I feel should work better.

Yes, we do store config data, but not log data (which can get large and incur the overhead of lots of concurrent writes which dqlite is not well suited for).

We do support pushing logs into Loki, which I think has some limited retry mechanism in it.

Take a look at the discussion on [LXD] Stream lifecycle and log events to Loki

There is a comment around parsing the local log files or setting up systemd logging which may be of use.

1 Like

Yh, this is what I am already doing manually, I would write the complete steps once I am done, the way it works, is, I’ll create a container that specifically deals with my proposed event method.

I’ll include a small RDMS (SQLite or any small one) in the container, create a SystemD service that pipes the lxc monitor -f json --type=lifecycle to a script in the container, this dumps each log data in the table, the good thing is, SystemD would handle the restart and pruning of the events table in the container every certain days, this way, it doesn’t grow too much.

Then that can periodically get the event for me using the exec option directly from the API, still working on it though, just a sketch.

1 Like