I have a cluster of 4 nodes running 6.11 installed from the Zabbily repo’s via apt on Debian.
(raspberry Pi OS on Rpi5’s) One node has inadvertently upgraded itself to Incus 6.12 and now Incus won’t start on that node reporting; “Wait for other cluster nodes to upgrade their versions, cluster not started yet” (which has broken a whole bunch of things on the network). All the other nodes are now reporting “Status Blocked” when I do an “incus cluster list”, so it’s unclear to me what’s safe to do re; trying to fix the problem.
Is there anyone who can tell me;
If I reboot one of the 6.11 nodes, will Incus restart on that node, or complain about versions
How many nodes do I need to upgrade to 6.12 before the cluster will re-start
If I revert the 6.12 node to 6.11, will it restart, and if so, what’s the best way to do this
How “safe” is it to just “apt upgrade” all my nodes? (Seeing one upgraded node fail to start is making me very nervous )
Having the Operating System effectively break the cluster simply by doing an update would seem to be undesirable behavior. Is there a way to stop Incus from applying schema updates or anything else that might disable one node, unless or until all nodes are ready to update themselves?
Upgrading Incus version in the cluster requires upgrading every node in the cluster. Just upgrade the incus package on all remaining nodes in the cluster.
Hi, thanks for that. I was sort of aware the intent was to upgrade them all at the same time, in this instance it’s a production cluster and I didn’t want to risk an upgrade while the backup cluster was already down due to upgrades.
I’ve had quite a lot of problems getting clustering to a point where I can use it without constantly being in fear of it falling on me … the idea of upgrading all nodes as the same time isn’t all that appealing.
… would be great if there was an incremental upgrade path … in this case one “test” node upgraded itself by mistake, now it’s fully-functional and everything else is blocked. I’d kinda like to nuke that node and put everything else back where it was, but I guess I should probably just speed up bringing the backup cluster back online.
As the lxc or incus is the orchestrator for the ‘containers’ and ‘VMs’ they manage, the upgrading procedure usually does not impact to any instances currently running. None instance will be stopped at that time. And the cluster node(s) restart in not required after such upgrades.
However, you might lost the control on incus cluster until at least 3 nodes with the same incus version are active.
Absolutely, however it’s the word “usually” when combined with a production environment that’s causing me the worry. I’ve been developing a maintainable deployment for Incus, OVS/OVN, IC etc over the last few months and have become painfully aware of how easy it is to completely bork an entire cluster. Whereas this is mostly down to human error, saying it’s “my fault” isn’t going to help me if I end up with a borked production cluster.
99 times out of a hundred, “apt upgrade” is quite safe … one in a hundred I could end up with a grub screen. Case in point, the machine that upgraded itself installed a standard kernel and I lost my geneve module which isn’t built on a standard RpiOS kernel.
Anyway, my ONE upgraded node is currently broken because “apt upgrade” won’t complete because incus won’t restart, and it won’t restart until all the other nodes upgrade their version of incus. So I upgraded one node, it broke itself, and blocked the cluster.
So … to fix both the broken upgraded node and blocked cluster, I have to run the upgrade (that failed on one node) on all the other nodes. Hence my dilemma (?)
My solution in this sort of instance it to migrate everything to an already upgraded and stable cluster, but it would be less resource intensive if I could upgrade and fix one node at a time. Having one node (in any context) block or break the cluster contradicts the resilience based reasons of why I want a cluster in the first place.