Connection timed out during long running lxc exec

I use lxd to host containers for running CI jobs, which take ~10 hours to complete sometimes. A frequent failure mode is something along these lines causing ‘lxc exec’ to drop out:

Error: read tcp> read: connection timed out

I’m not sure if this is due to underlying network issues or something else; perhaps the lxd snap being restarted?

Should this be reliable? Is this a reasonable use case for lxd exec? Any ideas what might be happening?

A timeout is most likely network related. The snap restarting would not cause a timeout, it would cause a connection reset.

The length of an exec session isn’t really a concern as far as LXD is concerned, though you probably will want to schedule your snap refreshes to happen at a time where you’re unlikely to have those going on.

LXD doesn’t immediately terminate exec sessions on refreshes but it only waits a few minutes before giving up and canceling them so it can restart.

We switched from using lxc exec to ssh and haven’t seen any more issues. If it’s a network issue, it seems ssh is much more resilient to it. I wonder if it’s setting different socket options from LXC.