Question about updating clustered nodes

oddjobz · May 14, 2025, 3:28am

I have a cluster of 4 nodes running 6.11 installed from the Zabbily repo’s via apt on Debian.
(raspberry Pi OS on Rpi5’s) One node has inadvertently upgraded itself to Incus 6.12 and now Incus won’t start on that node reporting; “Wait for other cluster nodes to upgrade their versions, cluster not started yet” (which has broken a whole bunch of things on the network). All the other nodes are now reporting “Status Blocked” when I do an “incus cluster list”, so it’s unclear to me what’s safe to do re; trying to fix the problem.

Is there anyone who can tell me;

If I reboot one of the 6.11 nodes, will Incus restart on that node, or complain about versions
How many nodes do I need to upgrade to 6.12 before the cluster will re-start
If I revert the 6.12 node to 6.11, will it restart, and if so, what’s the best way to do this
How “safe” is it to just “apt upgrade” all my nodes? (Seeing one upgraded node fail to start is making me very nervous )

Having the Operating System effectively break the cluster simply by doing an update would seem to be undesirable behavior. Is there a way to stop Incus from applying schema updates or anything else that might disable one node, unless or until all nodes are ready to update themselves?

tregubovav · May 22, 2025, 6:01pm

Upgrading Incus version in the cluster requires upgrading every node in the cluster. Just upgrade the incus package on all remaining nodes in the cluster.

There is an hidden feature inherited from lxc which allows to upgrade all nodes in the cluster automatically when any of nodes is upgraded to higher version. See details: https://discuss.linuxcontainers.org/t/incus-6-5-has-been-released/21544/11

oddjobz · May 22, 2025, 9:32pm

Hi, thanks for that. I was sort of aware the intent was to upgrade them all at the same time, in this instance it’s a production cluster and I didn’t want to risk an upgrade while the backup cluster was already down due to upgrades.

I’ve had quite a lot of problems getting clustering to a point where I can use it without constantly being in fear of it falling on me … the idea of upgrading all nodes as the same time isn’t all that appealing.

… would be great if there was an incremental upgrade path … in this case one “test” node upgraded itself by mistake, now it’s fully-functional and everything else is blocked. I’d kinda like to nuke that node and put everything else back where it was, but I guess I should probably just speed up bringing the backup cluster back online.

tregubovav · May 23, 2025, 12:06am

As the lxc or incus is the orchestrator for the ‘containers’ and ‘VMs’ they manage, the upgrading procedure usually does not impact to any instances currently running. None instance will be stopped at that time. And the cluster node(s) restart in not required after such upgrades.
However, you might lost the control on incus cluster until at least 3 nodes with the same incus version are active.

oddjobz · May 23, 2025, 8:36am

Absolutely, however it’s the word “usually” when combined with a production environment that’s causing me the worry. I’ve been developing a maintainable deployment for Incus, OVS/OVN, IC etc over the last few months and have become painfully aware of how easy it is to completely bork an entire cluster. Whereas this is mostly down to human error, saying it’s “my fault” isn’t going to help me if I end up with a borked production cluster.

99 times out of a hundred, “apt upgrade” is quite safe … one in a hundred I could end up with a grub screen. Case in point, the machine that upgraded itself installed a standard kernel and I lost my geneve module which isn’t built on a standard RpiOS kernel.

Anyway, my ONE upgraded node is currently broken because “apt upgrade” won’t complete because incus won’t restart, and it won’t restart until all the other nodes upgrade their version of incus. So I upgraded one node, it broke itself, and blocked the cluster.

So … to fix both the broken upgraded node and blocked cluster, I have to run the upgrade (that failed on one node) on all the other nodes. Hence my dilemma (?)

My solution in this sort of instance it to migrate everything to an already upgraded and stable cluster, but it would be less resource intensive if I could upgrade and fix one node at a time. Having one node (in any context) block or break the cluster contradicts the resilience based reasons of why I want a cluster in the first place.

oddjobz · May 27, 2025, 3:04pm

Ok, so having now been hit by lightening prior to getting my backup cluster back on-line, I can now answer my own questions.

If I reboot one of the 6.11 nodes, incus will hang during boot until all other nodes are available and upgraded
How many nodes, it would appear “all of them”
Probably safe in a controlled environment assuming no power failure

In this instance when the power came back, no nodes would boot. I managed to get in and run the Incus upgrade on all nodes, all of the upgrades hung waiting for other nodes to update their database.

Eventually (10 mins?) the upgrades timed out and completed, at which point the cluster came back online, bar one node that failed to install an upgraded ZFS module - for other reasons.

Morals of this story for me;

a. Do not upgrade anything on a Node unless you are ready to upgrade ALL nodes
b. Don’t do upgrades during a thunder-storm, loss of power is a problem
c. Don’t leave the cluster with any node in a blocked state for longer than absolutely necessary

osch · May 28, 2025, 3:42am

I like this as living in remote AU it is kind of normal to loose power even for a short period of time. Lesson learned get an UPS to bridge very short outages and avoid failure due to power spikes.

Running a Homelab is always fun and you never know what is the next issue…

oddjobz · May 28, 2025, 1:06pm

No, I know what’s coming next, I need to change my proxies to bind to host so I can get real IP’s for validation and logging, and I just know it’s not going to play nice for me …