Cluster does not comeback after shutdown- After Reboot Cause - Cannot listen on https socket, skipping..." err="listen tcp 10.0.0.102:8443: bind: cannot assign requested address"

stgraber · January 21, 2020, 4:15pm

Yeah, we’re on an accelerated schedule post-VM and storage rework to reduce the amount of time during which we need to do a lot of cherry-picks of fixes.

Those patches that fix bugs and apply cleanly, I’ll put in 3.19 but if we need custom versions of them, we’ll just wait until 3.20.

Tony_Anytime · January 21, 2020, 4:49pm

Yeah, couple weeks is find. I have my 3.18/9 temp fix… even if it take pumping it till it works for a few minutes. It would be nice to have cluster controls like we have for LXC container.
Like LXC Cluster makeprimemaster MOE, or LXC Cluster backupdb, cluster shutdown, a cluster safe mode which would allow individual LXC controls outside of cluster. So you can do LXC commands even if cluster is down. All tools to make cluster management easier.

amcduffee · January 21, 2020, 7:59pm

Should the above patches address the issue of constant new connections that I observed?

I realized through more testing that this case manifests under a much simpler scenario:

Stop all cluster nodes using ‘snap stop lxd’. Wait for them all to fully stop.
Start a single node using ‘snap start lxd’.
Watch lxd.log and you will see:

t=2020-01-21T11:32:10-0800 lvl=info msg=“Initializing global database”
t=2020-01-21T11:33:34-0800 lvl=warn msg=“Failed connecting to global database (attempt 6): failed to create dqlite connection: no available dqlite leader server found”
t=2020-01-21T11:33:47-0800 lvl=warn msg=“Failed connecting to global database (attempt 7): failed to create dqlite connection: no available dqlite leader server found”
Watch network connections increase with:

sudo watch -n1 ‘netstat -tnp | grep lxd | wc -l’

I also wanted to mention a couple of User Experience (UX) observations related to these cases.

When I run ‘snap start lxd’ on the single node the command will hang until its 1 minute timeout and then show an error. This is rather non-intuitive because the daemon didn’t fail to start, instead it is just stuck trying to rejoin a cluster. Below is the error shown when ‘snap start lxd’ fails:
$ sudo snap start lxd
2020-01-21T11:33:09-08:00 ERROR # systemctl start snap.lxd.activate.service snap.lxd.daemon.service

<exceeded
maximum runtime of 1m1s>
error: cannot perform the following tasks:
- start of [lxd.activate lxd.daemon] (# systemctl start snap.lxd.activate.service snap.lxd.daemon.service
<exceeded maximum runtime of 1m1s>)
- start of [lxd.activate lxd.daemon] (exceeded maximum runtime of 1m1s)
When a node is stuck in a state trying to reconnect to a cluster, like above, it doesn’t respond to stop commands. Again, I think this is non-intuitive because if it hasn’t joined up yet shouldn’t it be able to quickly give up trying and let the daemon exit? Result of a ‘snap stop lxd’ call in this case:
$ sudo snap stop lxd
error: cannot perform the following tasks:
- stop of [lxd.activate lxd.daemon] (# systemctl stop snap.lxd.activate.service snap.lxd.daemon.service
<exceeded maximum runtime of 1m1s>)
- stop of [lxd.activate lxd.daemon] (exceeded maximum runtime of 1m1s)
As an extension of (2), multiple ‘snap stop lxd’ calls in sequence will result in the same 1 minute timeout for the stop command itself. What isn’t apparent is that the underlying systemctl stop command is still running and will in fact give up trying after 9 minutes and forcefully kill the running daemon. The result of this is apparent by the following message after 9 minutes:

$ systemctl status snap.lxd.daemon.service
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2020-01-21 11:53:44 PST; 6s ago
Process: 173346 ExecStop=/usr/bin/snap run --command=stop lxd.daemon (code=exited, status=0/SUCCESS)
Process: 170377 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 170377 (code=exited, status=1/FAILURE)

Jan 21 11:32:09 node03 lxd.daemon[170377]: 12: fd: 18: unified
Jan 21 11:32:09 node03 lxd.daemon[170377]: t=2020-01-21T11:32:09-0800 lvl=warn msg="CGroup memory swap accounting
Jan 21 11:44:42 node03 systemd[1]: Stopping Service for snap application lxd.daemon…
Jan 21 11:44:42 node03 lxd.daemon[173346]: => Stop reason is: host shutdown
Jan 21 11:44:42 node03 lxd.daemon[173346]: => Stopping LXD (with container shutdown)
Jan 21 11:53:43 node03 lxd.daemon[173346]: ==> Forcefully killing LXD after 9 minutes wait
Jan 21 11:53:43 node03 lxd.daemon[173346]: => Stopping LXCFS
Jan 21 11:53:43 node03 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 11:53:44 node03 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
Jan 21 11:53:44 node03 systemd[1]: Stopped Service for snap application lxd.daemon.

I feel like the above 3 cases create panic scenarios for an Administrator when there may not actually be a good reason for it. One annoying side effect of the above case is it will hang a machine shutdown for 9 minutes unnecessarily.

Tony_Anytime · January 21, 2020, 9:12pm

Exactly… but let me tell you about panic…
You have 4 servers, which are really 3 because you have 4th as backup. And one server goes bad, needs upgrade/reboot or whatever. All of a sudden they are all in a deadlock. So then you shutdown and reboot all the servers, hoping for them to unstick. Oh did I mention a hundred containers per server, which means your getting lots of phone calls on why they are down. And you are freaking out and don’t know why they wont talk to each other. Down time causes panic. 9 minutes is an eternity. You can retry 9 times in one minute in that time. Nothing should take more than 2 minutes or even worse case 4 minutes.

freeekanayaka · January 22, 2020, 10:19am

Thanks for the detailed reproducer, I could now work that out and the leak is fixed here:

https://github.com/lxc/lxd/pull/6750

vosdev · January 24, 2020, 9:10am

Thank you for your “little script to make it work”. I have a 3-node IPv6 only cluster myself and as soon as I reboot a node the cluster dies. During the reboot the two servers left online will work just fine but as soon as the third comes back on its Failed to get current cluster nodes: driver: bad connection every second.

Stopping and starting snap.lxd.daemon.service hangs. Killing processes makes no difference. Your “slow lxd startup script” has just made my cluster come online.

I spend hours yesterday evening trying to fix my cluster and this morning I woke up to it being magically repaired at 04:09 AM. Logs tell me both the socket and service have been restarted. I had not touched the socket yet.

I will patiently wait for the commits mentioned in this topic to come live to fix these issues, i can’t wait to get my cluster up and running with Ceph.