Infrastructure bumpiness

stgraber · January 29, 2026, 11:13pm

Hello,

I just wanted to apologize for the periods of unavailability of our infrastructure today.

Today was a bit of a crazy infrastructure day for us as we were making quite a few changes to the physical cluster that acts as the origin for most of our websites and images. We typically rely on a network of proxies/relays in various countries for our high bandwidth services, but those are still only ever as good as the source servers they get their data from.

Today’s datacenter trip was meant to perform a lot of long awaited maintenance as well as transition to IncusOS for our own infrastructure. Essentially the goal was to:

Replace 6 older hard drives (4TB and 6TB) with newer 4TB enterprise SSDs
Replace the servers TPM 1.2 modules with newer 2.0 modules
Upgrade the servers TPM modules from 1.2 to 2.0
Update the BIOS and BMC firmwares on all servers
Replace the two older burnt out NVMe drives in each server with new enterprise NVMe drives
Re-install all the servers, moving them from Debian 12 onto IncusOS
Roll out Linstor alongside Ceph for storage

All of that happened over a period of around 5 hours, though there were obviously some unforeseen side-effects which all effectively boil down to Ceph being a bit annoying to work with when you need to replace most of its drives and reinstall most of the servers.

Ceph effectively got onto a state where it blocked all new writes due to a bunch of placement groups temporarily being unavailable. Those placement groups were all going to come back online following the re-installation of the server, but all I/Os getting blocked caused a bunch of issues, including a chicken and egg problem where the servers wanted to download IncusOS components from the image server, but weren’t able to

Anyway, after a few grueling hours of waiting for Ceph to move data around between disks to get enough replicas to be once again happy with letting I/Os through, we were able to finish the work on the servers and bring everything back online.

In the near future, we’ll be able to transition some of the workloads away from Ceph and onto Linstor providing more storage options and hopefully a more resilient infrastructure.

There’s a bit more cleanup to be done in a subsequent visit, but for now, I’m on a plane to Brussels for FOSDEM!

(It’s no coincidence that this work happened right before flying to Europe, the datacenter happens to be right next to the airport and doing a datacenter trip just on its own takes quite a lot of travel time)

paulocoghi · January 30, 2026, 4:09pm

Thank you @stgraber for your invaluable contribution for LXC and its ecosystem.

LXC wouldn’t be where it is today without all the contributions you’ve made and continue to make.

We can only thank you!

inflatador · January 31, 2026, 4:13pm

+1 to that, thanks for the hard work y’all put in on LXC/Incus, and sharing some of the lessons learned from this outage.
Have fun at FOSDEM, I wish I could’ve gone. Say hi to my coworkers (Wikimedia Foundation) if you see them!