LXD 3.15 Cluster Hangs with Massive Connection Count

aaronvegh · July 26, 2019, 1:21am

As discussed here, I’m unable to run LXD over the past couple weeks, perhaps coinciding with the update to LXD 3.15 (though I’m not sure). Running any LXC command on the command line simply hangs, and never executes. My cluster has three nodes, and all appear affected by the same issue. Looking at active connections on each machine, they have massive numbers apparently communicating with each other. For instance:

**aaron@codewerks-alpha** :$ sudo netstat -antpl |grep 8443 |grep ESTABLISHED |grep lxd | wc -l
9297

**aaron@codewerks-baker** : $ sudo netstat -antpl |grep 8443 |grep ESTABLISHED |grep lxd | wc -l
8815

**aaron@codewerks-charlie** : $ sudo netstat -antpl |grep 8443 |grep ESTABLISHED |grep lxd | wc -l
9643

These numbers can get over 100K when left long enough. On the advice of a helpful fellow on that Github issue, I wrote a script to kick LXD over and try fresh. This is effective in rubbing out the connections, but they begin exploding in size again, and I’m never able to run any LXC command.

I can see from other posts here that 3.15 is looking particularly problematic? But I’m at my wits’ end here. My cluster isn’t in production yet but it’s killing my ability to build my application and I have no idea what to do to fix this. Any suggestions for troubleshooting steps would be quite welcome!

freeekanayaka · July 26, 2019, 7:22am

I don’t think we have a way to reproduce the problem yet. The best way forward would be for you to provide us with a detailed list of steps that we can follow to independently reproduce the problem.

aaronvegh · July 26, 2019, 12:28pm

I’m super happy to give you whatever information I can to help with this. I’m not sure I can provide you with the steps that occurred to bring me to this situation, but perhaps there are configuration values I can show you that would communicate my current state more effectively? I’m happy to answer any questions you may have.

stgraber · July 26, 2019, 7:19pm

What we really need is some way for us to cause this on a newly created cluster so we can 1) reliably reproduce it 2) then try various things to track down where that’s coming from and fix it.

So far we’ve seen report of this for sure, I even ran into it a couple of time, but never managed to track down a way to reliably reproduce it. Without that, we can’t actually work out a fix.

aaronvegh · July 27, 2019, 2:24am

Understood, sadly. I can say that my cluster is pretty close to a fresh install. Three nodes, each running Ubuntu 18.04.2 LTS. I have auto-updates turned on and I’ve been keeping current with security patches, rebooting as required. This cluster is what I plan to have as my production system for when I complete my app and go to market. In the meantime it’s hosting no more than a few containers, with only one active at any given time.

Last weekend when I first noticed this problem, I was eventually able to break out of it when I noticed a running rsync process on one of the nodes and killed it — my application uses rsync to move files between the client app and a container on the cluster. When I did that, the cluster became available again. But a couple nights ago it reverted back to this hung behaviour, and while I did notice an rsync process again at one point, killing it didn’t have any effect. This time I noticed that the rsync process appeared related to LXD — is that the mechanism used for copying between nodes?

For what it’s worth, here’s the current count on those connections:

aaron@codewerks-alpha : ~ $ sudo netstat -antpl |grep 8443 |grep ESTABLISHED |grep lxd | wc -l
169377

rosbeef · July 27, 2019, 12:50pm

I understand that is something weird in my project but i try to do something like family cloud open source decentralized and passive cooling small hardware (actualy odroid hc1). Then searching for solution, the lxd for virtualization and ceph for geographical replication should be a good choice.
My priority is not performance but software administration autonomy

Back to the topic it should be nice for world small business autonomy to have decentralized lxd managment. and i’m really interested in participating (testing) to the evolution in that way.

freeekanayaka · July 27, 2019, 9:04pm

@rosbeef not sure why you moved the conversation from the github issue to here, it makes the thread a bit harder to follow, since it’s now split.

Anyway, from the logs you posted in the github issue, I suspect that at least one element of the problem might be network latency, which makes raft leadership unstable (hence the “no available dqlite leader server found” error). I can’t be 100% sure of that, but in any case I’d tend to say that we currently don’t support WAN deployments, with cluster members spread across different geographical regions.

rosbeef · July 28, 2019, 3:17pm

I moved because @aaronvegh move here but i understand now the mess so i stay here.

Finally i think it was a miss configuration from my part using preseed file on cluster client side.
I reconfigure all server and client and nothing bad is happening.

Thanks for your reactivity.

aaronvegh · July 28, 2019, 6:59pm

Not sure where best to post these issues. I’ve been on Github’s issues and here, but I’ll stick here since it seems @rosbeef is all set.

I was resolved to reinstall LXD this afternoon. I tried running snap remove lxd but got an error message error: snap "lxd" has "auto-refresh" change in progress.

I took a look at /var/snap/lxd/common/lxd/logs/lxd.log and it has thousands of this line:

t=2019-07-28T18:54:05+0000 lvl=warn msg="Failed connecting to global database (attempt 4655): failed to create dqlite connection: no available dqlite leader server found

Is any of this interesting data?

aaronvegh · July 28, 2019, 7:15pm

Something I just discovered. I see that a cluster must be running identical versions of LXD on all nodes. I ran snap list on each of my three nodes. My primary node shows LXD at v3.15, rev. 11381. The other 2 nodes are at rev. 11405! And I’m having a hell of a time trying to get my primary node updated.

**aaron@codewerks-alpha** : **~** $ sudo snap refresh lxd
error: snap "lxd" has "auto-refresh" change in progress
**aaron@codewerks-alpha** : **~** $ sudo snap changes
ID   Status   Spawn               Ready               Summary
35   Undoing  today at 03:00 UTC  -                   Auto-refresh snap "lxd"
36   Done     today at 19:05 UTC  today at 19:05 UTC  Refresh all snaps: no updates
aaron@codewerks-alpha:~$ sudo systemctl stop snap.lxd.daemon
Job for snap.lxd.daemon.service canceled.
aaron@codewerks-alpha:~$ sudo snap refresh lxd
error: cannot refresh "lxd": refreshing disabled snap "lxd" not supported

I’m continuing to work at this.

aaronvegh · July 28, 2019, 7:28pm

Success! Restarted the LXD daemon, managed to update to rev 11405, and then flushed the whole damn thing on every node again. When they came back online I had access to my containers again.

It strikes me, from my observations and reading about the forum here, that there remain some issues with nodes relying on auto-update mechanics while also relying heavily on remaining in sync on the same version. Anyway, I hope that my experience is useful in making sure this doesn’t happen again!

aaronvegh · July 28, 2019, 7:35pm

I wrote too soon!

Moments after sending my last response, the server froze up again. Briefly, I was getting a response:

aaron@codewerks-alpha : ~ $ lxc list
Error: failed to begin transaction: failed to create dqlite connection: no available dqlite leader server found

But now it’s simply hanging altogether.