I use lxd to host containers for running CI jobs, which take ~10 hours to complete sometimes. A frequent failure mode is something along these lines causing ‘lxc exec’ to drop out:
Error: read tcp 10.50.72.13:51440->10.246.64.5:8080: read: connection timed out
I’m not sure if this is due to underlying network issues or something else; perhaps the lxd snap being restarted?
Should this be reliable? Is this a reasonable use case for lxd exec? Any ideas what might be happening?
A timeout is most likely network related. The snap restarting would not cause a timeout, it would cause a connection reset.
The length of an exec session isn’t really a concern as far as LXD is concerned, though you probably will want to schedule your snap refreshes to happen at a time where you’re unlikely to have those going on.
LXD doesn’t immediately terminate exec sessions on refreshes but it only waits a few minutes before giving up and canceling them so it can restart.
We switched from using lxc exec to ssh and haven’t seen any more issues. If it’s a network issue, it seems ssh is much more resilient to it. I wonder if it’s setting different socket options from LXC.