Guide for upgrading?

johnm · December 16, 2021, 1:22pm

Is there a guide for upgrading existing setups between minor releases, e.g., 4.20 to 4.21? In particular, something that would work with snap on ubuntu. I seem to recall something being mentioned, somewhere, but I’ve looked around and cannot find anything.

Thanks

tomp · December 16, 2021, 4:08pm

The nearest thing we have is:

johnm · December 16, 2021, 4:25pm

So, is the expectation that given a 4.20 install, “snap refresh lxd --channel=4.21” should just work? I.e., the contents of common/ will be picked up without any modifications/updates required?

If this is the case, then I did just that after bringing all the cluster down, and updating 3 dedicated nodes used to take care of the database. But things did not come up - just “waiting”. So, I’ve been fiddling trying to get it back to 4.20, and retrying 4.21. No success so far.

tomp · December 16, 2021, 4:33pm

Snaps auto upgrade yes, but with the feature releases you often cannot downgrade as there can be schema changes that prevent that. Only the LTS releases have a commitment of no schema changes, which means you can always downgrade.

tomp · December 16, 2021, 4:34pm

You should switch back to latest/stable channel (currently 4.21) and then get some logs from all cluster members to see what is wrong.

sudo journalctl -u snap.lxd.daemon -n 300

johnm · December 16, 2021, 7:56pm

I ended up forcing the three nodes that take care of the database to use 4.21. Things seem to be working a little bit better but the system is not operational. There is a report in the logs of a TLS handshake error for one of the nodes.

Is it possible to generate a new certificate and populate the database(s) (global and/or local) by hand with it?

Can I update the databases by hand and remove references to old nodes that i’ve not yet updated to 4.21?

tomp · December 16, 2021, 8:11pm

All members must be on the same snap revision (not just the db members). They will hang on startup waiting for the others to upgrade. Can you confirm they are all upgraded? We also need to see the logs as requested. Thanks

johnm · December 16, 2021, 11:21pm

Right. I will reinstall all nodes to be at 4.21 once I get the db nodes updated and working. At the moment, though, those db nodes are referencing non-db nodes (that I want to remove from service but cannot because lxc does not get any response). My goal is to get the db nodes up and working and responding to lxc calls. And I don’t see any way, at this point, other than modifying the db.

tomp · December 16, 2021, 11:28pm

I don’t really follow what you mean by “DB Nodes” - all LXD members of cluster are potential DB members, and all take part in raft. So they all need to be running the same revision, otherwise none of them will operate.

You cannot modify the database if the cluster is not running.

tomp · December 16, 2021, 11:30pm

If you let me know whats in your log files I asked for I may be able to help further.

johnm · December 16, 2021, 11:43pm

In short, there are only 3 nodes that I want to keep from the whole cluster. Keeping those 3 alone, I am having problems.

This is what is reported for one of those nodes (the others are fine):

http: TLS handshake error from 10.0.0.27:36790: remote error: tls: bad certificate

tomp · December 16, 2021, 11:45pm

But you must bring up the whole cluster to the same revision before removing members, otherwise the remaining members won’t come online.

tomp · December 16, 2021, 11:48pm

@stgraber @mbordere can you advise on this one? It sounds like the user has shutdown members of the cluster, but wants to manually remove them without upgrading them (perhaps they’ve gone already?). Is this a job for lxd cluster edit?

stgraber · December 16, 2021, 11:55pm

Normally, no, lxd cluster edit and lxd cluster remove-raft-node are used to modify the dqlite state. In this case you need to also modify the global database, removing all traces of the server, its instances, storage pools, networks, …

Your best bet is a patch.global.sql that deletes the server from the global nodes table. That should take care of most of the records, then once that’s done, you can use lxd cluster remove-raft-node to finish the job.

Though it’s not something we’d ever recommend, you’re not supposed to ever remove a server that way and depending on how many servers are left and the roles prior to that upgrade you can end up with quite a bit of config to fix up…