Seeming cluster failure with: ""Retry failed db interaction (driver: bad connection)""

atrius · April 27, 2020, 8:57pm

So, I have a four node cluster running under ESXi. All is well for a while, but then randomly every so often the cluster bugs out with this scrolling in the syslog: “Retry failed db interaction (driver: bad connection)”

No lxc commands work (lxc cluster list, lxc list and so on) and all return: “Error: cannot fetch node config from database: driver: bad connection” which would make sense.

The only thing that seems to cure it is a systemctl reload snap.lxd.daemon.service on all nodes. Thankfully, none of this seems to impact containers running. However, I don’t really know what’s causing it or what to do to make it go away permanently. This is a cluster for work which I’m trying to sell everyone else on. This kind of failure doesn’t make it easy though

What’s odd is my far larger cluster, running under QEMU/KVM, is rock solid and has no issues at all.

Not sure where to go from here

stgraber · April 27, 2020, 9:46pm

@freeekanayaka

freeekanayaka · April 28, 2020, 10:28am

Please could you enable debug logging on all nodes, with:

snap set lxd daemon.debug=true

then restart all daemons, wait for the issue to happen again and attach here a tarball with the logs of all nodes, even trimmed to about 5 minutes before the issue first happened, if they are too large. You can also send them to me by email, if you prefer.

Thanks

atrius · April 28, 2020, 2:12pm

Sure, can do. I already had that in place on one of the nodes and all it said before things went sideways was:

Apr 27 16:07:42 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:42-0400 lvl=dbug msg=Handling ip=@ method=GET url=/1.0/operations/ba3c9a2e-ce18-4873-8ff5-16d8f93bc53c user=
Apr 27 16:07:42 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:42-0400 lvl=dbug msg=“Connecting to a remote LXD over HTTPs”
Apr 27 16:07:47 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:47-0400 lvl=dbug msg=“Found cert” name=0
Apr 27 16:07:47 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:47-0400 lvl=dbug msg=“Replace current raft nodes with [{ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:
8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter} {ID:4 Address:10.4.90.4:8443 Role:voter}]”
Apr 27 16:07:47 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:47-0400 lvl=dbug msg=“Rebalance cluster”
Apr 27 16:07:52 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:52-0400 lvl=dbug msg=“Database error: &errors.errorString{s:“driver: bad connection”}”
Apr 27 16:07:52 starbase6 lxd.daemon[8380]: t=2020-04-27T16:07:52-0400 lvl=dbug msg=“Retry failed db interaction (driver: bad connection)”

Prior to that, it was mostly messages very similar to that actually. Off hand, looks to me that something caused the DB to crash in the process of that final rebalance. Are there logs anywhere to check for that in particular?

freeekanayaka · April 28, 2020, 2:59pm

There are no specific logs to check rebalancing, but if any error had occurred while rebalancing that would have been logged. So it feels the problem lies somewhere elese.

The most common cause of a “bad connection” error is benign and temporary, namely a change of leader. Usually the database client just connects to the new leader and retries. I would expect the debug log of some other node to show something about leadership change around that time.

freeekanayaka · April 28, 2020, 3:00pm

BTW, just to be sure, are you using the latest 4.0 LXD snap?

atrius · April 28, 2020, 7:03pm

Correct. Currently on 4.0.1. There are a lot of messages about rebalancing the cluster. Is such a frequency expected?

freeekanayaka · April 28, 2020, 8:03pm

It depends on how many cluster nodes you have. But generally, yes, that log message is just a periodic check if anything needs to be rebalanced. Most of the time it will be a no-op.

atrius · April 28, 2020, 9:05pm

Four in this cluster. Seems to be happening around ever 10 seconds or so?

freeekanayaka · April 29, 2020, 7:59am

With 4 nodes you’ll have 3 voting nodes, 1 stand-by node and 0 spare nodes. Since by default we attempt to have 3 voting nodes, 2 stand-by nodes and N>=0 spare nodes, the cluster checks if there is a new node that it can promote to stand-by, and yes that happens every 10 seconds.

If you want to get rid of that particular message, then:

lxc config set cluster.max_standby=1

should make the cluster stop trying to rebalance. But again, that shouldn’t be an issue either way.

atrius · April 29, 2020, 2:51pm

The message is fine, so long as it doesn’t indicate any issues. Though, I do wonder if the smaller number of nodes may be part of what’s leading to the instability. My cluster here at home is eight nodes and, so far as I know, hasn’t failed once. While the prototype at work is, as previously noted, only four and fails frequently. The way it appears to me is that at some point or another in the process of rebalancing something goes wrong, possibly network related, and then it just never properly recovers. Here’s what I was able to capture from the logs of all the nodes. Prior to these points, all the nodes were emitting normal messages both from LXD and the system at large.

starbase6: 29 00:26:27 starbase6 lxd.daemon[16877]: t=2020-04-29T00:26:27-0400 lvl=dbug msg="Found cert" name=0
starbase6: 29 00:26:28 starbase6 lxd.daemon[16877]: t=2020-04-29T00:26:28-0400 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"
starbase6: 29 00:26:28 starbase6 lxd.daemon[16877]: t=2020-04-29T00:26:28-0400 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"

lxdnode1: 29 00:25:48 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Failed heartbeat for 10.4.90.5:8443: failed to send HTTP request: Put \"https://10.4.90.5:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

lxdnode2: 29 00:25:52 lxdnode2 lxd.daemon[12565]: t=2020-04-29T00:25:52-0400 lvl=dbug msg="Rebalance cluster"
lxdnode2: 29 00:26:02 lxdnode2 lxd.daemon[12565]: t=2020-04-29T00:26:02-0400 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"

lxdnode3:  29 00:26:27 lxdnode3 lxd.daemon[9308]: t=2020-04-29T00:26:27-0400 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"

So, yeah. The way it reads to me is that it was trying to rebalance, one of them timed out for whatever reason and then it never recovered from that error

freeekanayaka · April 29, 2020, 3:13pm

Would it be possible to have a tarball of the logs of all those 4 nodes? Just to make sure I have the complete picture, thanks.

atrius · April 29, 2020, 3:41pm

Sure. Where can I send them?

freeekanayaka · April 29, 2020, 3:53pm

free.ekanayaka@canonical.com, thanks

atrius · April 29, 2020, 5:23pm

I’ll have those along momentarily. Thanks

atrius · April 29, 2020, 5:30pm

Let me know if you didn’t get the email

freeekanayaka · April 29, 2020, 5:59pm

Got the logs now yes, thanks, I’ll look at them tomorrow.

atrius · April 29, 2020, 6:02pm

Sounds good. Thanks again

freeekanayaka · April 30, 2020, 8:50am

So I’m not sure exactly where the bug is, but it seems it’s caused by this sequence of events:

lxdnode1 (10.4.90.2), which is currently the leader, starts a heartbeat round:

Apr 29 00:25:40 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:40-0400 lvl=dbug msg="Starting heartbeat round"

It can successfully perform the heartbeat against lxdnode3 (10.4.90.4):

Apr 29 00:25:43 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:43-0400 lvl=dbug msg="Successful heartbeat for 10.4.90.4:8443"

Apr 29 00:25:43 lxdnode3 lxd.daemon[9308]: t=2020-04-29T00:25:43-0400 lvl=dbug msg="Replace current raft nodes with [{ID:4 Address:10.4.90.4:8443 Role:voter} {ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter}]"

and against lxdnode2 (10.4.90.3):

Apr 29 00:25:45 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:45-0400 lvl=dbug msg="Sending heartbeat to 10.4.90.3:8443"

Apr 29 00:25:45 lxdnode2 lxd.daemon[12565]: t=2020-04-29T00:25:45-0400 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter} {ID:4 Address:10.4.90.4:8443 Role:voter}]"

It tries to perform the heartbeat against starbase6 (10.4.90.5):

Apr 29 00:25:46 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:46-0400 lvl=dbug msg="Sending heartbeat to 10.4.90.5:8443"

Around that time starbase6 seems to be kind of online, but either unresponsive (high CPU usage on the host running this VM?) or with some network issues, so the heartbeat fails:

Apr 29 00:25:31 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:31-0400 lvl=dbug msg="Rebalance cluster"
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Found cert" name=0
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:10.4.90.5:8443 Role:stand-by} {ID:2 Address:10.4.90.2:8443 Role:voter} {ID:3 Address:10.4.90.3:8443 Role:voter} {ID:4 Address:10.4.90.4:8443 Role:voter}]"
Apr 29 00:25:48 starbase6 lxd.daemon[16877]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Rebalance cluster"
Apr 29 00:25:49 starbase6 dnsmasq-dhcp[17134]: RTR-ADVERT(lanbridge) fd42:13c6:b2a7:eef0::

Apr 29 00:25:48 lxdnode1 lxd.daemon[12168]: t=2020-04-29T00:25:48-0400 lvl=dbug msg="Failed heartbeat for 10.4.90.5:8443: failed to send HTTP request: Put \"https://10.4.90.5:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

We do have integration tests that exercise a node going down and failing to respond heartbeats, but they make the node is completely offline so connections fail in the first place. However this seems to be a particular situation: starbase6 seems to accept the TCP/HTTP connection from lxdnode1 but then it appears essentially stuck from the point of view of lxdnode1, which fails the heartbeat in a way that apparently leads to some bad state.

I’ll try to tweak a little bit our time tolerance and look at that code in general.

In the meantime, how stable do you think your ESXi VMs are in terms of CPU and network? Do you think there could be hiccups lasting more than a second, where either the CPU or the network are unresponsive?

atrius · April 30, 2020, 2:23pm

In the meantime, how stable do you think your ESXi VMs are in terms of CPU and network? Do you think there could be hiccups lasting more than a second, where either the CPU or the network are unresponsive?

For various reasons I’m going with the presumption that the network is not stable. Your analysis pretty much what I figured was the general case, though I wasn’t sure on the details. For most of this time, I was presuming the root cause was network instability which then triggered some unknown (to me) condition which subsequently caused things to bug out.

Let me know when there’s something new to test? In the mean, I’ll poke our network people to see what they can do as well