What upgrade policy is advised for clustered hypervisors

Hi. I have two LXD hypervisors in a cluster, running a few containers on a ZFS storage. Hypervisors run both on Debian 10, and I would like to upgrade them both to Debian 11. Both lxc and lxd are installed with snap, and are in version 4.18. Both systems have 300+ days of uptime, as do the lxc services (so if the packages are in v4.18, I am not sure in which version are the running services).

My question is a bit general: what upgrade policy would you do to safely upgrade both hypervisors to Debian 11? Must I upgrade both the hypervisors at the same time, or can I move all the containers on one hypervisor, upgrade the other, then move all the containers to the updated hypervisor, and then upgrade the other?

I was thinking of:

  1. Do applicative backups of the containers. Store them somewhere safe.
  2. Do snapshots of the containers.
  3. Restart all the containers, to be sure they can.
  4. Restart the lxd services on both hypervisors, to be sure they can.
  5. Evacuate hypervisor A from the cluster. Then reboot it, to be sure it can.
  6. Plug back hypervisor A in the cluster, then evacuate hypervisor B, and then reboot it, to be sure it can.
  7. Plug back hypervisor B. Take a coffee.
  8. Evacuate hypervisor A. Upgrade Debian. Reboot.
  9. Plug back hypervisor A. Evacuate hypervisor B. Upgrade Debian. Reboot.
  10. Plug back hypervisor B.
  • Does it seem reasonable to you?
  • What can go wrong in this process?
  • What would you do to prevent things to go wrong?
  • What documentation piece should I read that I missed?

Thank you for your help.

Since LXD itself won’t change version and only the underlying OS will, you can update in whatever order you want and don’t need all servers to be on the same version.

As far as LXD will notice, it will just see differing kernels versions, but that’s perfectly fine.

One thing to be careful with the newer Debian is that it uses cgroup2 by default, so older containers may have a hard time booting (Ubuntu 16.04 for example won’t, I suspect CentOS 7 and the like will similarly fail). You could minimize that problem by forcing your Debian 11 systems to still run in cgroup1 mode.

If lxc info shows you’re running LXD 4.18, then LXD itself restarted recently and you’ll be fine on that front. Trying to restart the containers may be a good idea indeed, though as mentioned above, cgroup2 on Debian 11 may make them fail post-OS upgrade unfortunately.

Testing the evacuation ahead of time is probably a good plan too. In my case I do it weekly as I apply kernel updates on my servers so I know it will work, but if you never did it before, it’s probably a good idea to try it and make sure that the containers do actually get evacuated and not just stopped.

1 Like

Thanks for your answer.

Good news. It seems it is already 4.18.

lxc info | grep server_version
server_version: "4.18"

Also it seems cgroup2 is already there:

lxc info | grep cgroup
cgroup2: "true"

That indicates whether liblxc supports cgroup2, it doesn’t tell you what your host uses. For that, you’ll want to look at /sys/fs/cgroup and what it contains.

A pure cgroup1 setup would have per-controller directories and no unified directory.
A hybrid setup (most common) will have cgroup1 + unified directory.
A pure cgroup2 setup will only have a unified tree (similar to content of unified directory).

My understanding is that Debian 11 uses pure cgroup2, though maybe not for folks upgrading. In any case, something to keep an eye on.

Setting systemd.unified_cgroup_hierarchy=false on the kernel command line should force systemd back into a hybridge cgroup1 mode.

1 Like

I have the unified directory:

% ls /sys/fs/cgroup
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

And I also have some lxc.monitor.XXXX and lxc.payload.XXXX directories in almost every /sys/fs/cgroup subdirectory, where XXXX is the names of the containers. This would mean I have an hybrid setup as you thought?

I am curious, the daemon can be restarted without the containers being restarted? I thought systemctl restart lxd would have restarted all the containers.

sudo systemctl reload snap.lxd.daemon will restart just the daemon and not the instances.

How to know what kind of container will have trouble? All of my containers are Alpine 3.14 (so quite recent) Is there some checks that can be done to know for sure?

Ok so this is why it is a good idea to provently restart the containers.

IIRC, for systemd-based distributions such as Debian or CentOS, systemd support cgroups v2 since version 226, this could be a first indicator. For Alpine, I don’t know tbh since it use OpenRC as init system.