Cluster does not comeback after shutdown- After Reboot Cause - Cannot listen on https socket, skipping..." err="listen tcp 10.0.0.102:8443: bind: cannot assign requested address"

Ok, So I delete everything and reinstall all four servers. I used LXD 3.19 just for testing to see if it would do same thing.
I was able to bring down one of the database server and bring it up, no problem, but had to do a systemctl restart snap.lxd.daemon to make it become part of cluster again.
I was able to do the same with 2 and 3, both db servers, shutting them down, and starting ok. Had to restart service and eventually they would get back to working.
The fourth server, did the same thing, except this time when I would do restart, nothing would happen.
The Cluster is dead, all say same thing.
systemctl start lxd.socket
Failed to start lxd.socket: Unit lxd.socket not found.

This is the same problem that is all over the forum.
Seems to be communication issue of database .
root@QG3:/home/ic2000# lxd --debug --group lxd
INFO[01-17|22:52:27] LXD 3.19 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[01-17|22:52:27] Kernel uid/gid map:
INFO[01-17|22:52:27] - u 0 0 4294967295
INFO[01-17|22:52:27] - g 0 0 4294967295
INFO[01-17|22:52:27] Configured LXD uid/gid map:
INFO[01-17|22:52:27] - u 0 1000000 1000000000
INFO[01-17|22:52:27] - g 0 1000000 1000000000
INFO[01-17|22:52:27] Kernel features:
INFO[01-17|22:52:27] - netnsid-based network retrieval: no
INFO[01-17|22:52:27] - uevent injection: no
INFO[01-17|22:52:27] - seccomp listener: no
INFO[01-17|22:52:27] - seccomp listener continue syscalls: no
INFO[01-17|22:52:27] - unprivileged file capabilities: yes
INFO[01-17|22:52:27] - cgroup layout: hybrid
WARN[01-17|22:52:27] - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored
INFO[01-17|22:52:27] - shiftfs support: no
INFO[01-17|22:52:27] Initializing local database
DBUG[01-17|22:52:27] Initializing database gateway
DBUG[01-17|22:52:27] Start database node id=3 address=10.0.0.103:8443
EROR[01-17|22:52:27] Failed to start the daemon: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000001095-0000000000001153: entries batch 74 starting at byte 785000: entries count in preamble is zero
INFO[01-17|22:52:27] Starting shutdown sequence
DBUG[01-17|22:52:27] Not unmounting temporary filesystems (containers are still running)
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000001095-0000000000001153: entries batch 74 starting at byte 785000: entries count in preamble is zero

It dies here. These poor guys can see the other servers
DBUG[01-17|22:54:24] Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.102:8443/internal/database: dial tcp 10.0.0.102:8443: connect: no route to host address=10.0.0.102:8443 id=1 attempt=0
DBUG[01-17|22:54:24] Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.104:8443/internal/database: dial tcp 10.0.0.104:8443: connect: connection refused address=10.0.0.104:8443 id=1 attempt=0
DBUG[01-17|22:54:24] Dqlite: server connection failed err=failed to establish network connection: Head https://10.0.0.103:8443/internal/database: dial tcp 10.0.0.103:8443: connect: connection refused address=10.0.0.103:8443 id=1 attempt=0

At this point, I can reinstall all the servers and get it working again…but if I shut them down, same thing will happen. It loses the unix.socket thingy. There is obviously something very wrong here.

I could be something unique to my install, but it is just a fresh copy of ubuntu18.04 will all the upgrades before I install LXD on Supermicro servers. Nothing special not even a RAID.
Any help would be helpful since I have to move a large cluster, and I have spent 2 week already debugging what should have taken a day.

I spent another bunch of hours rebuilding and testing cluster today and I have narrowed down to more specifics.
The problem is related to individual machines losing there ability to communicate with each other over /var/snap/lxd/common/lxd/unix.socket. If it only one machine down, either database machine or not, I can get it working again pretty easily with systemctl restart snap.lxd.daemon.
However, if two or more are down, it can take along time and many times to get it working.
If I bring down all four, it takes 5-10 minutes of
systemctl restart systemd-networkd
systemctl restart snap.lxd.daemon.unix.socket
systemctl stop snap.lxd.daemon
systemctl restart snap.lxd.daemon
to make them work again.

There is catch 22 situation going on, the unix.socket is unavailable, and then cluster server are unavailable and then it gets stuck in a loop.

I can get it running but it is not very elegant to be hitting with a hammer. This seems to be a problem that has been around since early versions of LXD.
The whole cluster master, voting and retry system is pretty intolerant of any thing as simple as a reboot. I hope 3.20 is better, because this is the weakest part of LXD.
I hope my countless hours of playing with this thing help someone down the line.

I am going to chime in and say that I have also seen these quirky behaviors at least two separate times when I had to shutdown an entire 3 machine cluster for power outages.

Cluster is running the following snap version:

$ lxc version
Client version: 3.18
Server version: 3.18

After all of the machines were back up and running I encountered issues with ‘lxc list’ and ‘lxc cluster list’ commands. I have seen explicitly this error message too:

Error: Failed to execute query: driver: bad connection

I am not sure what ‘driver: bad connection’ means specifically. My best guess is some sort of stale or half closed connection that the code isn’t effectively detecting or re-opening. I doubt the error message is pointing at any other networking related issue because the same 3 machines are hosting a Ceph cluster that comes up without issue into a HEALTH_OK state.

Thinking about this aloud, what could lead to a bad connection? Does LXD have any sort of timeout or grace period during which all cluster nodes must return or it will fail? In other words, what would happen if say 1 out of 3 nodes didn’t come back online for 2-3 hours after the first two?

Hopefully I can help come across some more meaningful steps to reproduce the issue.

Regarding:

Error: Failed to execute query: driver: bad connection

that’s returned if one of the following errors happens 5 times in a row with a retry interval of 250 milliseconds:

  • current leader loses leadership
  • network connection to current leader is broken

The error is transient if a majority of cluster nodes is up, because they will eventually elect a new healthy leader (so ‘lxc list’ and ‘lxc cluster list’ will eventually stop returning that error).

If 1 out of 3 nodes doesn’t come back online fr 2-3 hours after the first two, the cluster will still be available, meaning that you can run ‘lxc list’ or ‘lxc cluster list’ on any of the two online nodes (or using any of them as remote). All other commands not requiring the offline node will work as well (e.g. you won’t be able to start a container on the offline node of course).

I definitely just caused the same issue with ‘snap stop lxd’ and ‘snap start lxd’ commands. I don’t have an exact sequence that caused it, but I am seeing a couple of alarming symptoms. The below happened with 1 of 3 machines always being kept “offline” by stopping the snap service.

First of all, the log on one node is getting flooded with:

lvl=warn msg=“Failed to get current cluster nodes: driver: bad connection”

Based on netstat output the lxd instance seems to be on a downward spiral of opening connections to itself:

$ sudo netstat -tnp | grep lxd | head -n25
tcp 0 0 10.0.5.190:8443 10.0.5.190:53054 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:57018 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:57156 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:55486 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:52732 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:56320 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:39202 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:35496 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:37820 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:53904 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:54498 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:36720 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:56488 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:56352 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:37160 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:38858 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:55988 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:54466 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:55898 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:54446 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:54736 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:55782 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.190:52748 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:56200 10.0.5.190:8443 ESTABLISHED 116157/lxd
tcp 0 0 10.0.5.190:8443 10.0.5.203:36370 ESTABLISHED 116157/lxd

Over time, the connection count keeps rising:

$ sudo netstat -tnp | grep lxd | wc -l
1623
$ sudo netstat -tnp | grep lxd | wc -l
1626
$ sudo netstat -tnp | grep lxd | wc -l
1628
$ sudo netstat -tnp | grep lxd | wc -l
1632
$ sudo netstat -tnp | grep lxd | wc -l
1711

On a second node the log was showing periodic:

lvl=warn msg=“Failed connecting to global database (attempt 55): failed to create dqlite connection: no available dqlite leader server found”

Likewise the machine seemed to be spiraling with connections:

$ sudo netstat -tnp | grep lxd | head -n25
tcp 0 0 10.0.5.203:53574 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:8443 10.0.5.203:54420 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54240 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:35860 10.0.5.190:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:35566 10.0.5.190:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53758 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54402 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53352 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:36402 10.0.5.190:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:34732 10.0.5.190:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53250 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:8443 10.0.5.203:53930 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53860 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53458 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54790 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54036 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54668 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:35638 10.0.5.190:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:8443 10.0.5.203:53188 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:53448 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54002 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:8443 10.0.5.203:54888 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:8443 10.0.5.203:53232 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:54818 10.0.5.203:8443 ESTABLISHED 114464/lxd
tcp 0 0 10.0.5.203:35418 10.0.5.190:8443 ESTABLISHED 114464/lxd
$ sudo netstat -tnp | grep lxd | wc -l
806
$ sudo netstat -tnp | grep lxd | wc -l
938
$ sudo netstat -tnp | grep lxd | wc -l
1088
$ sudo netstat -tnp | grep lxd | wc -l
1729

The whole idea of having a 3 machines with one leader, that if that leaders dies, you have chaos is wrong. If you have 1 out of 3 go down, it should not be the end of the cluster. It computer should be it own master,and if it find someone else the one decides to be slave based on some priority table. Or it could even be last down. Right now it happens every single time I shut cluster down completely and start it.
I am going to look at open connections, I didn’t think of that. I was thinking it was just unix socket in general that was unavailable. Going to look at it tomorrow.

If you have 3 machines online, one of them at random will be elected as leader. If that leader dies, and you still have 2 machines online, another leader will be automatically elected at random. If the new leader in turns dies and you are left with only one machine, the cluster will be unavailable.

You can reboot the whole cluster and once a majority of machines come back online eventually a new leader will automatically be elected and the cluster will be available.

As mentioned, release 3.20 will bring some improvements that minimize the time the cluster is unavailable when the current leader gets shutdown gracefully, since that leader will transfer leadership to another online machine (if available) before shutting down, so no new election needs to take place.

Okay, it seems we have regressed in 3.19 wrt to 3.18.

I did several tests along the lines reported by @Tony_Anytime/@amcduffee and at least in one occasion I was able to deadlock the cluster.

The fix is in this PR:

Actually no, I was mistaken, sorry. The issue that the PR solves was introduced after 3.19, so it only affects master, not 3.19 or earlier. I’ll keep testing to see if I can reproduce what was reported.

A problem similar has been happening since version 3.xx something came about. It also happen in version 1, That is what I found in research. Not sure about before ver 3, But I can tell you for sure it happens in 3.18 and 3.19. Glad you are looking into it. I will look at code later to see if I can make sense of it. But the part that if leader dies and other is elected, voting dies if there is only one. And when you do a reboot, different machines boot at different times so you will always have one machine first, and then it looks like it locks up from there. I believe that is why I have to keep restarting lxd then until they sync up.

BTW, My little script to make it work

echo ‘Reseting Networking’
systemctl restart systemd-networkd
sleep 10
echo ‘Stopping Socket’
systemctl stop snap.lxd.daemon.unix.socket
sleep 10
echo ‘Starting Unix Socket’
systemctl start snap.lxd.daemon.unix.socket
sleep 10
echo ‘Stopping lxd’
systemctl stop snap.lxd.daemon
echo ‘Starting lxd’
systemctl restart snap.lxd.daemon
sleep 10
lxc cluster list

Ok, I might have found the issue, and I pushed a fix here:

it will be merge into master and included in 3.20.

Will see what @stgraber thinks about backporting it to 3.19.

Cool… thanks. … because you know, “Murphy always happens when you least expect it and not in the node you expect”

How long before 3.20 is coming?

Within 2 weeks I think.

Yeah, we’re on an accelerated schedule post-VM and storage rework to reduce the amount of time during which we need to do a lot of cherry-picks of fixes.

Those patches that fix bugs and apply cleanly, I’ll put in 3.19 but if we need custom versions of them, we’ll just wait until 3.20.

Yeah, couple weeks is find. I have my 3.18/9 temp fix… even if it take pumping it till it works for a few minutes. It would be nice to have cluster controls like we have for LXC container.
Like LXC Cluster makeprimemaster MOE, or LXC Cluster backupdb, cluster shutdown, a cluster safe mode which would allow individual LXC controls outside of cluster. So you can do LXC commands even if cluster is down. All tools to make cluster management easier.

Should the above patches address the issue of constant new connections that I observed?

I realized through more testing that this case manifests under a much simpler scenario:

  1. Stop all cluster nodes using ‘snap stop lxd’. Wait for them all to fully stop.
  2. Start a single node using ‘snap start lxd’.
  3. Watch lxd.log and you will see:

    t=2020-01-21T11:32:10-0800 lvl=info msg=“Initializing global database”
    t=2020-01-21T11:33:34-0800 lvl=warn msg=“Failed connecting to global database (attempt 6): failed to create dqlite connection: no available dqlite leader server found”
    t=2020-01-21T11:33:47-0800 lvl=warn msg=“Failed connecting to global database (attempt 7): failed to create dqlite connection: no available dqlite leader server found”

  4. Watch network connections increase with:

    sudo watch -n1 ‘netstat -tnp | grep lxd | wc -l’

I also wanted to mention a couple of User Experience (UX) observations related to these cases.

  1. When I run ‘snap start lxd’ on the single node the command will hang until its 1 minute timeout and then show an error. This is rather non-intuitive because the daemon didn’t fail to start, instead it is just stuck trying to rejoin a cluster. Below is the error shown when ‘snap start lxd’ fails:

    $ sudo snap start lxd
    2020-01-21T11:33:09-08:00 ERROR # systemctl start snap.lxd.activate.service snap.lxd.daemon.service

    <exceeded
    maximum runtime of 1m1s>
    error: cannot perform the following tasks:

    • start of [lxd.activate lxd.daemon] (# systemctl start snap.lxd.activate.service snap.lxd.daemon.service

    <exceeded maximum runtime of 1m1s>)

    • start of [lxd.activate lxd.daemon] (exceeded maximum runtime of 1m1s)
  2. When a node is stuck in a state trying to reconnect to a cluster, like above, it doesn’t respond to stop commands. Again, I think this is non-intuitive because if it hasn’t joined up yet shouldn’t it be able to quickly give up trying and let the daemon exit? Result of a ‘snap stop lxd’ call in this case:

    $ sudo snap stop lxd
    error: cannot perform the following tasks:

    • stop of [lxd.activate lxd.daemon] (# systemctl stop snap.lxd.activate.service snap.lxd.daemon.service

    <exceeded maximum runtime of 1m1s>)

    • stop of [lxd.activate lxd.daemon] (exceeded maximum runtime of 1m1s)
  3. As an extension of (2), multiple ‘snap stop lxd’ calls in sequence will result in the same 1 minute timeout for the stop command itself. What isn’t apparent is that the underlying systemctl stop command is still running and will in fact give up trying after 9 minutes and forcefully kill the running daemon. The result of this is apparent by the following message after 9 minutes:

    $ systemctl status snap.lxd.daemon.service
    ● snap.lxd.daemon.service - Service for snap application lxd.daemon
    Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
    Active: failed (Result: exit-code) since Tue 2020-01-21 11:53:44 PST; 6s ago
    Process: 173346 ExecStop=/usr/bin/snap run --command=stop lxd.daemon (code=exited, status=0/SUCCESS)
    Process: 170377 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
    Main PID: 170377 (code=exited, status=1/FAILURE)

    Jan 21 11:32:09 node03 lxd.daemon[170377]: 12: fd: 18: unified
    Jan 21 11:32:09 node03 lxd.daemon[170377]: t=2020-01-21T11:32:09-0800 lvl=warn msg="CGroup memory swap accounting
    Jan 21 11:44:42 node03 systemd[1]: Stopping Service for snap application lxd.daemon…
    Jan 21 11:44:42 node03 lxd.daemon[173346]: => Stop reason is: host shutdown
    Jan 21 11:44:42 node03 lxd.daemon[173346]: => Stopping LXD (with container shutdown)
    Jan 21 11:53:43 node03 lxd.daemon[173346]: ==> Forcefully killing LXD after 9 minutes wait
    Jan 21 11:53:43 node03 lxd.daemon[173346]: => Stopping LXCFS
    Jan 21 11:53:43 node03 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
    Jan 21 11:53:44 node03 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
    Jan 21 11:53:44 node03 systemd[1]: Stopped Service for snap application lxd.daemon.

I feel like the above 3 cases create panic scenarios for an Administrator when there may not actually be a good reason for it. One annoying side effect of the above case is it will hang a machine shutdown for 9 minutes unnecessarily.

Exactly… but let me tell you about panic…
You have 4 servers, which are really 3 because you have 4th as backup. And one server goes bad, needs upgrade/reboot or whatever. All of a sudden they are all in a deadlock. So then you shutdown and reboot all the servers, hoping for them to unstick. Oh did I mention a hundred containers per server, which means your getting lots of phone calls on why they are down. And you are freaking out and don’t know why they wont talk to each other. Down time causes panic. 9 minutes is an eternity. You can retry 9 times in one minute in that time. Nothing should take more than 2 minutes or even worse case 4 minutes.

1 Like

Thanks for the detailed reproducer, I could now work that out and the leak is fixed here:

Thank you for your “little script to make it work”. I have a 3-node IPv6 only cluster myself and as soon as I reboot a node the cluster dies. During the reboot the two servers left online will work just fine but as soon as the third comes back on its Failed to get current cluster nodes: driver: bad connection every second.

Stopping and starting snap.lxd.daemon.service hangs. Killing processes makes no difference. Your “slow lxd startup script” has just made my cluster come online.

I spend hours yesterday evening trying to fix my cluster and this morning I woke up to it being magically repaired at 04:09 AM. Logs tell me both the socket and service have been restarted. I had not touched the socket yet.

I will patiently wait for the commits mentioned in this topic to come live to fix these issues, i can’t wait to get my cluster up and running with Ceph.

1 Like