Operations expire window

Hey,

Would it be possible to override the default 5s expire to something like 30s or 60s?

Greetings.

Why do you need that?

Lets say, you have network issues, packet loss, inbound DDoS Attack etc.
In this case, I am unable to connect within 5 seconds to the endpoint.

When I reach the endpoint, the operation is gone and I don’t know if it has been completed or was a failure. Neither I want to rely on /wait in this case.

The recommended way to handle this is to connect to /1.0/events?type=operations prior to issuing asynchronous in LXD. This guarantees you can’t miss events and keeps the number of API calls down to a minimum as you only need to connect to it once regardless of number of operations.

How reliable would be using a websocket instead of asking /operations?
If I call operations, I always know, what status it has, if the connection times out, I know its a failure, I call it again to get my response.

When using a websocket, how reliable is it in terms of packet loss up to 70-80% and high latencies such as 2000ms, if I set proper timeouts, it should be fine? And TCP should take care that nothing gets lost I hope so.

Why does your network have such high levels of loss and latency?

The Provider seems to have more issues lately, likely they are getting DDoSed more often.
Yes I could move the backend to a different network/provider to resolve or reduce these issues.

However, it would be good, to kinda have a more robust system, so whenever this happens, to make it possible to keep the finished operations a bit longer.

Can you add an option to set a custom timeout on operations?

Right, websocket is still HTTPS and uses TCP so will be resilient to packet loss and won’t cause event to disappear within an established connection.

I’m wary of adding an option to keep operations around just to workaround a crappy network as it’s non-insignificant work to put a new config option in place and test and validate it in the future.

The reason for the 5s expiry is because operations typically keep a lot of references to internal structures which in turn keep references to file descriptors and internal connections. Keeping operations around for a prolonged amount of time can therefore hold on to a lot of memory and a lot of open files. Initially this was bad enough for busy clusters (10k+ instances) to hit the kernel’s file descriptor limit.

1 Like

So it would be a option, I see.

Well, I did not deploy this Setup on purpose into an unstable Network, it became unstable.
Can the Provider fix it? Possibly, could it happen again? Yes
I don’t want to get you “fix” my issue here, maybe I said something wrong, its the Provider which is in charge to fix their Network in this case or mine to migrate.

However, making LXD more robust with such a setting just in case would be nice, so these issues would be mitigated to a minimum by default.

In the end I need to add 3rd party dependencies to even make websockets work, which could some argue makes the application less reliable and more dependent on a simple thing.

Makes sense, however I don’t see how this would conflict making it possible to increase the limit.