No database node in 3 machine cluster after graceful shutdowns

amcduffee · January 14, 2020, 7:41pm

I have had to shutdown a 3 machine cluster a couple of times recently due to power outages and/or scheduled outages. In both cases I was able to gracefully shutdown all three machines, but after bringing them back up I am noticing weird issues and errors in LXC commands that deal with the cluster/raft databases.

For example:

$ lxc list
Error: failed to begin transaction: not an error
$ lxc cluster list
Error: failed to begin transaction: not an error

It seems in this situation that at least one machine does respond to the ‘lxc cluster list’ command, but it shows no database nodes:

$ lxc cluster list
±---------------±------------------------±---------±-------±------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±---------------±------------------------±---------±-------±------------------+
| node-ctl01 | https://10.0.5.190:8443 | NO | ONLINE | fully operational |
±---------------±------------------------±---------±-------±------------------+
| node03 | https://10.0.5.203:8443 | NO | ONLINE | fully operational |
±---------------±------------------------±---------±-------±------------------+
| node04 | https://10.0.5.204:8443 | NO | ONLINE | fully operational |
±---------------±------------------------±---------±-------±------------------+

I have already done ‘snap restart lxd’ on all machines multiple times, but it didn’t fix anything.

$ snap version
snap 2.42.5
snapd 2.42.5
series 16
ubuntu 18.04
kernel 4.15.0-66-generic

$ snap info lxd
name: lxd
…
snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: stable
refresh-date: 43 days ago, at 18:38 PST
…
installed: 3.18 (12631) 57MB -

They appear to be doing heartbeats just fine:

$ lxc monitor --pretty --type=logging
DBUG[01-14|11:37:40] New event listener: 89460967-cdca-4dc9-9512-64b4c5e8bf0b
DBUG[01-14|11:37:43] Starting heartbeat round
DBUG[01-14|11:37:43] Heartbeat updating local raft nodes to [{ID:1 Address:10.0.5.203:8443} {ID:2 Address:10.0.5.204:8443} {ID:3 Address:10.0.5.190:8443}]
DBUG[01-14|11:37:45] Sending heartbeat to 10.0.5.190:8443
DBUG[01-14|11:37:45] Sending heartbeat request to 10.0.5.190:8443
DBUG[01-14|11:37:45] Successful heartbeat for 10.0.5.190:8443
DBUG[01-14|11:37:49] Sending heartbeat to 10.0.5.203:8443
DBUG[01-14|11:37:49] Sending heartbeat request to 10.0.5.203:8443
DBUG[01-14|11:37:49] Successful heartbeat for 10.0.5.203:8443
DBUG[01-14|11:37:49] Completed heartbeat round
DBUG[01-14|11:37:53] Starting heartbeat round
DBUG[01-14|11:37:53] Heartbeat updating local raft nodes to [{ID:1 Address:10.0.5.203:8443} {ID:2 Address:10.0.5.204:8443} {ID:3 Address:10.0.5.190:8443}]
DBUG[01-14|11:37:56] Sending heartbeat request to 10.0.5.203:8443
DBUG[01-14|11:37:56] Sending heartbeat to 10.0.5.203:8443
DBUG[01-14|11:37:56] Successful heartbeat for 10.0.5.203:8443
DBUG[01-14|11:38:00] Sending heartbeat to 10.0.5.190:8443
DBUG[01-14|11:38:00] Sending heartbeat request to 10.0.5.190:8443
DBUG[01-14|11:38:00] Successful heartbeat for 10.0.5.190:8443
DBUG[01-14|11:38:00] Completed heartbeat round
DBUG[01-14|11:38:03] Starting heartbeat round
DBUG[01-14|11:38:03] Heartbeat updating local raft nodes to [{ID:1 Address:10.0.5.203:8443} {ID:2 Address:10.0.5.204:8443} {ID:3 Address:10.0.5.190:8443}]

Any idea what I should try next?

amcduffee · January 14, 2020, 9:05pm

I want to clarify my previous statement about ‘snap restart lxd’ not working. What I should have said is that doing so has restored the ability of all three machines to run ‘lxc list’ and ‘lxc cluster list’. However, it has not changed the listing showing all machines as ‘NO’ for database.

The ‘snap restart lxd’ command timed out on one of the three machines the last time I ran it on all three. A second invocation of the command succeeded on the machine after the first one timed out.

I can’t say with absolute certainty at this point, but I think the cluster may be getting itself into this state after a certain amount of time. I have already had to do the snap restart a couple of separate times in order to get the list commands working again on all machines. I know when it needs to happen because running a list command will result in the transaction error as mentioned above.

I think the cluster isn’t in a healthy state because all nodes show ‘NO’ for database? I am not entirely sure about this thought because I don’t really understand the meaning or implications of all nodes being database=NO.

freeekanayaka · January 14, 2020, 9:36pm

Except from the fact that the database column of lxc cluster list says NO, and possibly that the cluster might somehow revert to a bad state after a certain time, is everything else working fine? IOW, once you have restored, does it work completely normally at least for a while?

amcduffee · January 14, 2020, 9:54pm

Yes, what I have tested so far has worked without issue. I am able to start, stop and move containers between nodes where the container is backed by a Ceph RBD.

I am going to monitor the machines and see if they end up back in a state with the transaction error messages. I will check whether or not the start/stop/move commands still work under that condition if it manifests again.

freeekanayaka · January 14, 2020, 10:14pm

Ok, if they cluster keeps working, it might just be a glitch. I’ll give you a SQL command to fix the database column and that should be it.

amcduffee · January 14, 2020, 11:04pm

Sure, but I am going to hold off on running the SQL for a few days because I am more concerned that this might be a chronic issue. An aesthetic bug in the cluster list table doesn’t matter for me as much as the possibility that the cluster might be getting into a state that could have other issues.

All three machines show the same output as seen below:

$ lxd sql global “SELECT * FROM nodes_roles;”
±--------±-----+
| node_id | role |
±--------±-----+
±--------±-----+

amcduffee · January 17, 2020, 7:01pm

I just checked the machines again and the problem has not reappeared. Possibly it is a one time quirk after rebooting the cluster.

What is the SQL command for fixing the ‘lxc cluster list’ table showing database NO?

freeekanayaka · January 18, 2020, 9:11am

Assuming that:

lxd sql local "SELECT * FROM raft_nodes"

returns you exactly the three nodes part of your cluster, then the fix is:

lxd sql global "INSERT INTO nodes_roles (node_id, role) SELECT id, 0 FROM nodes"

amcduffee · January 20, 2020, 8:52pm

That fixed the output of cluster list. Thanks.