[Solved] Wait for other cluster nodes to upgrade their versions


#1

Looks like I spoke too soon, how to force a schema upgrade?

All three servers give this message:

DBUG[04-09|14:17:06] Database error: &errors.errorString{s:“schema check gracefully aborted”}
INFO[04-09|14:17:06] Wait for other cluster nodes to upgrade their versions

One node in the cluster is down of 4, is it waiting for that one node?


(Stéphane Graber) #2

Yes, sounds like the 4th server is running an older version, LXD automatically triggers a refresh when that happens, are all servers on the same channel?


#3

They are now, one was not.

I tried to put the others on candidate to upgrade them.

I also pulled down the global/db.bin file. The 4th node is unreliable in a hardware sense. I plan on removing it anyways. Would it be ok to just remove the node from the sqlite db on all servers and replace the file?

All are on stable now. Except the down node which probably isn’t coming back up.


#4

DBUG[04-09|14:36:41] Failed heartbeat for ~.~.~.~:8443: failed to send HTTP request: Put https://:8443~.~.~.~/internal/database: dial tcp ~.~.~.~:8443: i/o timeout


(Stéphane Graber) #5

You shouldn’t be directly messing with global especially when clustered unless you have a backup of the entire directory of all database nodes at the exact same time.

db.bin is a temporary file which is assembled from the RAFT logs, restoring it will not have any effect on the DB and it will be overwritten automatically on startup.

What does lxc cluster list on one of the working nodes currently gives you?

As of a few hours ago, both candidate and stable mean the exact same thing, so now is a good time to snap refresh lxd --stable to ensure all nodes are on stable.


#6

lxc cluster list is unresponsive.

snap “lxd” has no updates available


(Stéphane Graber) #7

@freeekanayaka shouldn’t lxc cluster list still be responsive when a cluster has inconsistent versions?

In this case, the cluster has 4 nodes, 3 of them upgraded properly, the 4th is offline (and therefore not upgraded).

I’d expect to still be able to lxc cluster list and lxc cluster remove --force in such a case so we don’t get stuck.


(Stéphane Graber) #8

Ok, so other than the previous message which needs some input from @freeekanayaka what’s the state of that 4th node? Can it be brought back to life long enough for the cluster to let you remove it?

The alternative would be a global DB patch on one of the other nodes to bump the DB version of the offline node so that the cluster think it’s consistent, lets you start and then remove the node.


#9

The alternative is fine. Best way to do that? Otherwise, it would most likely be a motherboard replacement and a trip to the datacenter.

lxc cluster remove --force ipaddress is also unresponsive.

ipaddress:port says Error: The remote “-.-.-.-” doesn’t exist
hostname:port says Error: The remote “hostname” doesn’t exist


(Stéphane Graber) #10

What’s the output of sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "SELECT * FROM nodes;"?


#11

1|3APP||-.-.-.-:8443|14|125|2019-04-09 11:49:32.883668059-05:00|0
3|4APP||-.-.-.-:8443|14|125|2019-04-09 11:49:32.920211806-05:00|0
4|2APP||-.-.-.-:8443|14|125|2019-04-09 11:49:32.96878266-05:00|0
5|1APP||-.-.-.-:8443|14|118|2019-03-18 14:29:38.268341959-05:00|0


(Stéphane Graber) #12

Ok, create a file at /var/snap/lxd/common/lxd/database/patch.global.sql which contains:

UPDATE nodes SET api_extensions=125 WHERE id=5;

Then run systemctl reload snap.lxd.daemon

Only do this on a single node, that should get inserted in the database on startup and hopefully unblock the cluster.


#13

alright so, nodes on the one server started. When doing a lxc list the containers running on other nodes show error. Should I restart lxd on them as well?

lxc cluster list responds with correct output now.


(Stéphane Graber) #14

Yeah, a reload on the other nodes should make them come back online, at which point you’ll only have that 4th server in bad shape and you should be able to remove it.


#15

All nodes responding normally now, I appreciate your help! =)


(Free Ekanayaka) #16

@stgraber as per the initial clustering design, if we detect a version mismatch among nodes, we don’t allow the cluster to start or operate. At the moment we don’t have any exemption to this rule, so lxc cluster list and lxc cluster remove are not available. We surely need to have some way to solve situations like this, so we have to address the issue in some way. Implementation-wise it might be a bit tricky to do that using lxc cluster remove, perhaps we should treat this use case in a similar way as other disaster recovery scenarios that we’ll need tooling and stories for: you shutdown your cluster and do some offline state change.


(Stéphane Graber) #17

Yeah, I guess it could be something to add to the lxd sub-command for emergency cluster rework. I’ll add that to the issue.

@CyrusTheVirusG https://github.com/lxc/lxd/issues/5550


#18

Why not allow cluster nodes to start that are updated though?

So long as quorum is achieved anyhow.

When the out of date node tries to participate in the cluster reject the request, with a need to update response and perhaps even trigger a refresh.

On the lxc end when a user issues a command on the out of date node maybe add a message that indicates the node needs to be updated when receiving the need to update response or even when selected as a target on a remote node.

If a node that is removed from the cluster when it is offline and comes back online, send a response that indicates it is not a cluster member. On the lxc end perhaps prompt if the user would like to lxd init on the node or rejoin the cluster if it is of a compatible version, if it is not a compatible version indicate that the node would need to be updated to rejoin.

Obviously if you remove the node and had containers running on it they would need to be retargeted in a distributed manner to the rest of the nodes. I didn’t have any containers running on the down node so unsure if this happens already or not.

I suppose that would only apply if there was shared storage, in the instance that it is not shared storage if/when the down node comes back online convert it to a standalone lxd instance. So the containers that were stored locally are not lost.

Additionally, when issuing lxc commands in a waiting state unless yall decide to implement the above perhaps give a lxc response indicating “Waiting for all nodes to upgrade schema…” so we aren’t left wondering what is happening with our cluster without killing all container instances to start in debug mode.

Along that same line, maybe allow for toggling debug mode while lxd is running. lxc --enable-local-debug & lxc --disable-local-debug | lxc --toggle-local-debug.

Alright, done editing for now =). The above should increase stability, responsiveness and availability of clusters. One of these days I will get around to learning Go so I can actually contribute, stuck deep in Python and c++ projects for now.