Excluding offline node from refresh, but everything seems ok

Can you also maybe try running a constant ping between both systems to rule out some kind of packet loss?

I don’t see a high load, at least not something that matches the consistency of the roughly 5 minutes interval of the error. And there is no packet loss either, no spikes in ping times.

168 packets transmitted, 168 received, 0% packet loss, time 171005ms
rtt min/avg/max/mdev = 0.140/0.228/0.277/0.018 ms

The heartbeat burst is really weird. Within the burst there are two sets of “Starting/Completed heartbeat round”, and there are four(!) sets of heartbeats. As far as I can see all heartbeats within the burst are valid.

Look at the time stamps of Starting heartbeat round

DBUG[02-02|14:26:10] Starting heartbeat round 
DBUG[02-02|14:26:20] Starting heartbeat round 
DBUG[02-02|14:26:30] Starting heartbeat round 
DBUG[02-02|14:26:40] Starting heartbeat round 
DBUG[02-02|14:26:50] Starting heartbeat round 
DBUG[02-02|14:26:56] Starting heartbeat round 
DBUG[02-02|14:26:56] Starting heartbeat round 
DBUG[02-02|14:27:00] Starting heartbeat round 
DBUG[02-02|14:27:10] Starting heartbeat round 
DBUG[02-02|14:27:20] Starting heartbeat round 
DBUG[02-02|14:27:30] Starting heartbeat round 
DBUG[02-02|14:27:40] Starting heartbeat round 
DBUG[02-02|14:27:50] Starting heartbeat round 
DBUG[02-02|14:27:56] Starting heartbeat round 
DBUG[02-02|14:27:56] Starting heartbeat round 
DBUG[02-02|14:28:00] Starting heartbeat round 
DBUG[02-02|14:28:10] Starting heartbeat round 
DBUG[02-02|14:28:20] Starting heartbeat round 
DBUG[02-02|14:28:30] Starting heartbeat round 
DBUG[02-02|14:28:40] Starting heartbeat round 
DBUG[02-02|14:28:50] Starting heartbeat round 
DBUG[02-02|14:28:56] Starting heartbeat round 
DBUG[02-02|14:28:56] Starting heartbeat round 
DBUG[02-02|14:29:00] Starting heartbeat round 

@stgraber as you can see from the Starting timestamps there is a consistent timing. Not only does it do a regular heartbeat at 10 second interval. There are two (!) extra heartbeat calls at :56, and very consistent with 60 second intervals.

These heartbeats at :56 are nested. This is of course not good. I suggest we need some kind of protection that an ongoing heartbeat cannot be entered a second (or even third, I’ve seen that too) time.

What happens with a heartbeat response after it times out?

BTW, I’ve briefly analysed the heartbeat response times. As far as I can see they are all immediate. Until for one node it suddenly hits the timeout. But since the pace of heartbeats is messed up once per minute it is hard to understand what is going on. Mystery 1. Why that one node? Mystery 2. Why the extra two heartbeats?

I’ve created a github issue with a request to at least avoid nested heartbeat calls. It’s probably not the solution of my problem(s), but it will simplify analysis.

@stgraber Update: last night we had snap update lxd 4.11

Some observations:

  • the “Failed heartbeat” messages are gone
  • the leader role shifted to .45
  • the nested heartbeats are still there
  • the heartbeart intervals are not very regular

Of course it is possible that the “Failed heartbeat” depends on timing, and that that is the reason for not occurring. The failed heartbeat was with .45 and this is now the one with leader role.

On the previous leader the intervals were very precise. On the new leader it is messy. As if it is being reset all the time.

New errors. Every hour the leader issues database errors (via the lxc monitor debug).

location: luts
metadata:
  context: {}
  level: dbug
  message: 'Database error: protocol.Error{Code:5, Message:"database is locked"}'
timestamp: "2021-02-09T13:16:26.440226284+01:00"
type: logging


location: luts
metadata:
  context: {}
  level: dbug
  message: Retry failed db interaction (database is locked)
timestamp: "2021-02-09T13:16:26.540130917+01:00"
type: logging

These messages continue for about 30 seconds, and this is repeated every hour. Meanwhile it does the heartbeats.

That’s unlikely to be an issue, those are indeed debug messages and indicate that the database is temporarily in use by another transaction and that LXD has retried the transaction.

The heartbeats are meant to be irregular, I don’t remember the exact logic, but basically they’re trying to ensure that a batch is completed before the last one + cluster timeout is reached. The actual heartbeats are also sent in parallel each with a small random delay to avoid all of them going out at once and causing a potential network/db spike.

I still need to spend some time to investigate your Github issue and see why we may have multiple batches running at once though.

You’re referring to this

func (hbState *APIHeartbeat) Send(ctx context.Context, cert *shared.CertInfo, localAddress string, nodes []db.NodeInfo, delay bool) {
	heartbeatsWg := sync.WaitGroup{}
	sendHeartbeat := func(nodeID int64, address string, delay bool, heartbeatData *APIHeartbeat) {
		defer heartbeatsWg.Done()

		if delay {
			// Spread in time by waiting up to 3s less than the interval.
			time.Sleep(time.Duration(rand.Intn((heartbeatInterval*1000)-3000)) * time.Millisecond)
		}
		logger.Debugf("Sending heartbeat to %s", address)

However, the start of func heartbeat is expected to be precisely at 10 second intervals, I think. That is what I saw before the switch to 4.11. The “Starting heartbeat round” log messages were at a very consistent interval. You can probably see that on your clusters too.

But now, that is not true anymore.

The hearbeat task is set to run every 10s but it’s configured not to allow for concurrent runs in the task logic, instead starting the next one ideally on the 10s timer but if it took too long, then schedule it immediately after the exiting task.

On my cluster (4.11), this is behaving as expected:

stgraber@castiana:~/data/code/lxc/lxd (stgraber/vm)$ lxc monitor --type=logging --pretty
DBUG[02-09|20:00:22] New event listener: 974ab9e4-ec95-426e-ba81-3d271e8bfdd5 

DBUG[02-09|20:00:22] Starting heartbeat round 
DBUG[02-09|20:00:22] Heartbeat updating local raft nodes to [{ID:1 Address:[2602:fd23:8:200::100]:8443 Role:voter} {ID:2 Address:[2602:fd23:8:200::101]:8443 Role:voter} {ID:3 Address:[2602:fd23:8:200::102]:8443 Role:voter}] 
DBUG[02-09|20:00:25] Sending heartbeat to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:25] Sending heartbeat request to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:25] Sending heartbeat to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:25] Sending heartbeat request to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:25] Successful heartbeat for [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:25] Successful heartbeat for [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:25] Completed heartbeat round 

DBUG[02-09|20:00:32] Starting heartbeat round 
DBUG[02-09|20:00:32] Heartbeat updating local raft nodes to [{ID:1 Address:[2602:fd23:8:200::100]:8443 Role:voter} {ID:2 Address:[2602:fd23:8:200::101]:8443 Role:voter} {ID:3 Address:[2602:fd23:8:200::102]:8443 Role:voter}] 
DBUG[02-09|20:00:36] Sending heartbeat request to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:36] Sending heartbeat to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:36] Successful heartbeat for [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:38] Sending heartbeat to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:38] Sending heartbeat request to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:38] Successful heartbeat for [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:38] Completed heartbeat round 

DBUG[02-09|20:00:42] Starting heartbeat round 
DBUG[02-09|20:00:42] Heartbeat updating local raft nodes to [{ID:1 Address:[2602:fd23:8:200::100]:8443 Role:voter} {ID:2 Address:[2602:fd23:8:200::101]:8443 Role:voter} {ID:3 Address:[2602:fd23:8:200::102]:8443 Role:voter}] 
DBUG[02-09|20:00:43] Sending heartbeat request to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:43] Sending heartbeat to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:43] Successful heartbeat for [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:48] Sending heartbeat to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:48] Sending heartbeat request to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:48] Successful heartbeat for [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:48] Completed heartbeat round 

DBUG[02-09|20:00:52] Starting heartbeat round 
DBUG[02-09|20:00:52] Heartbeat updating local raft nodes to [{ID:1 Address:[2602:fd23:8:200::100]:8443 Role:voter} {ID:2 Address:[2602:fd23:8:200::101]:8443 Role:voter} {ID:3 Address:[2602:fd23:8:200::102]:8443 Role:voter}] 
DBUG[02-09|20:00:55] Sending heartbeat to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:55] Sending heartbeat request to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:55] Successful heartbeat for [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:00:56] Sending heartbeat to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:56] Sending heartbeat request to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:56] Successful heartbeat for [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:00:56] Completed heartbeat round 

DBUG[02-09|20:01:02] Starting heartbeat round 
DBUG[02-09|20:01:02] Heartbeat updating local raft nodes to [{ID:1 Address:[2602:fd23:8:200::100]:8443 Role:voter} {ID:2 Address:[2602:fd23:8:200::101]:8443 Role:voter} {ID:3 Address:[2602:fd23:8:200::102]:8443 Role:voter}] 
DBUG[02-09|20:01:04] Sending heartbeat request to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:01:04] Sending heartbeat to [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:01:05] Successful heartbeat for [2602:fd23:8:200::101]:8443 
DBUG[02-09|20:01:06] Sending heartbeat request to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:01:06] Sending heartbeat to [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:01:06] Successful heartbeat for [2602:fd23:8:200::102]:8443 
DBUG[02-09|20:01:06] Completed heartbeat round 

Outside of the schedule heartbeat, an additional hearbeat round is triggered when a new database connection is established. That’s to help with quickly recovering during leader election. Maybe it’s that codepath that’s being hit in your case.

May I suggest to add a logging message when that happens?

In the current situation I see three(!) extra heartbeat rounds, all within one second. And they come every minute (60 seconds), precisely, as if they are (indirectly) scheduled. And inbetween the normal 10s heartbeat rounds.

Not sure why that would happen quite so much, it’s sounding like dqlite is disconnecting and reconnecting a lot somehow.

Anyway, to avoid the main issue, I think we’ll want to add both locking (to avoid two concurrent runs) and logic to not run if another run completed less than 15s ago.

@stgraber I see that a lock is implemented to avoid nested heartbeat round (#8450). Great.

Nevertheless, I’m still a bit worried how that will affect my situation. To be honest, I have no knowledge of the Go world. I’m guessing that the next call will wait until the lock is free. If that’s true then I will see the three extra calls “pile up”. Every minute, I see 9(!) heartbeat rounds, each take roughly 8 seconds. That’s 72 seconds. Can you please tell me that I’m wrong. I’m scared that my system will blow up when 4.12 is deployed.

Thanks for #8453

Now eagerly looking forward to 4.12

@stgraber Now that 4.12 is up and running. Would you be interested to look at a new monitor log? basically the behavior is still there, except that nested heartbeat rounds are suppressed now.

Yeah, I guess updated logs may be useful.

How or where shall I send it?

Can you e-mail me (stgraber at ubuntu dot com), the following:

  • lxd sql global "SELECT * FROM nodes;" on any of the nodes
  • lxd sql local "SELECT * FROM raft_nodes;" from each of the nodes
  • 30s of output from lxc monitor --type=logging --pretty from each of the nodes

@stgraber

Hmm, I was just about to ask if you were able to see something from the logs when I checked the systems. Huh? The problem seems to be gone, since March 18, around 13:21 local time.

Nothing obvious changed or happened at that moment. No reboots, no snap updates.

I think the 4.12 snap was deployed on either March 10, 12 or 17. These are the moments when there was a “Stopped/Started” log message. And I have the directory timestamps.

# ls -l /snap/lxd
total 0
drwxr-xr-x 19 root root 261 mrt 12 00:55 19766
drwxr-xr-x 19 root root 261 mrt 17 18:06 19823
lrwxrwxrwx  1 root root   5 mrt 17 21:46 current -> 19823

So, for some reason LXD got fed up and stopped annoying me. I think I need to be more gentle to it. :slight_smile: