RAID-1 for boot drive

I love the idea, but in real world a server is often installed on /dev/md/XXX - and as I can understand, incusOS is unable to cope with that. It’s really required to add some flexibility than the situation where “one drive fits all”. It’s not only RAID-1, I usually use RAID-10 on two drives as it’s faster to read from.

That’s not been our experience so far for large fleet of production servers.

When ordering large number of servers, the OS drive these days tends to either be a single low capacity high endurance NVMe SSD or a hardware RAID-1 (DELL BOSS for example) of two small NVMe drives.

Software RAID-1 for a boot drive on a modern server doesn’t actually work all that well as reliable replication of the ESP and getting the firmware to correctly fallback to the second drive in failure is extremely unreliable.

For our particular situation we have the extra complexity of the tooling being used (systemd-repart and the LUKS+TPM stack) making very early boot secure re-assembly of software RAID-1 be pretty challenging.

The local zpool created on the boot drive should be reasonably easy to allow to extend to another drive, whether as RAID-1 or just RAID-0. That’s because that’s the only bit of storage which we actually get to bring up after we have access to the system configuration. The restriction on preventing to add additional devices to this pool is pretty artificial and something we could lift easily enough.

I must oppose your opinion, on debian the EFI partition is not demanding at all:

update-initramfs -u
update-grub
umount /boot/efi
dd if=/dev/nvm0n1p1 of=/dev/nvm1n1p1
mount /boot/efi

never failed me across hundreds of servers even once and gives me a reliable EFI based system where one drive can fail completely anytime.

But of course I would love to see what is your alternative scenario involving IncusOS where I could lose a drive without consequences. IncusOS would be more convenient than a pure Debian.

Right, that’s the normal approach on a mutable system.

This still has a few issues though. You replicating the ESP from one drive to the other doesn’t provide any real guarantee of the system still being bootable when a drive dies.

First you’ll need to make sure that you always have two NVRAM entries, basically having Boot0001 boot from the shim.efi on the first disk and Boot0002 boot from the shim.efi on the second disk. That can be done with the right efibootmgr calls and a properly functional EFI firmware (not all machines handle this correctly).

But then for the tricky part. Drives have an annoying tendency to rarely fully die. Instead they’ll hang around on the bus but just throw a bunch of errors when you read from them. Your firmware isn’t very smart (and really can’t be). If it finds the GPT partition that’s listed in the NVRAM and can mount the FAT filesystem on it and find something that looks like an executable, it will attempt to load and execute that, if the drive then hangs or throws errors, you’ll be stuck there and won’t fallback to your second drive.

That’s for the problems with this approach on a mutable system.
On immutable you get some extra fun:

  • You cannot generate a new initrd or change the kernel parameters, both are part of an immutable, signed UKI
  • You cannot modify the EFI boot configuration (add/remove entries) as that would lead to unexpected PCR measurements and would fail the boot
  • You cannot override the boot order as that would similarly lead to differing PCR values and fail to decrypt the drive

So in a secure immutable world, making something like this work, would require:

  • Code to replicate the ESP after every update (pretty easy to do, though potentially slow and will cause wear as we’d need to replicate the whole 2GB every time and would invalidate the discard data in the process)
  • Code to manage the EFI NVRAM config, ensure Boot0001 and Boot0002 are correct and make sure that the pre-calculated TPM state accounts for those values and ideally also accounts for alternate values where either of the boot entries disappear (if the drive goes away)
  • Code to always replicate the A/B immutable partitions across both devices
  • Changes to udev to be able to handle all the duplicate partitions that this will cause. We entirely rely on GPT partition types and labels for auto-detection, so having two version of each will make a mess. Lennart mentioned some recent work in udev to help with that though.
  • Changes to systemd-repart to handle the creation of a RAID-1 on top of two LUKS devices (one per drive), or creation of LUKS on top of a RAID-1 partition config
  • Changes to dracut to handle picking one of the two copies of the usr partitions (not freak out because of all the duplication) AND have a way to re-assemble the RAID-1+LUKS+ext4 root partition WITHOUT any way to provide it config about it (again, we don’t have unencrypted persistent storage we can use)

It’s a fair bit of work and not something we’re likely to be putting our own resources towards, but I’ll be sure to ask Lennart about RAID-1 in repart next time I bump into him as that’s the main piece that can’t in theory be worked around with a bunch of shell scripts.

If the systemd-repart stuff gets done, then in theory Debian 14 would ship with the needed logic, so we may be able to add support for this in a couple of years.

1 Like

Anyway, our recommendation is to use a high endurance low capacity drive for the OS (I typically got with 256GB), this holds the OS itself and OS data as well as the initial local pool. The local pool in this scenario should basically only be used to hold the image and backup volumes for Incus, not hold actual instances.

You then create an actual storage pool for the instance data using higher capacity drives (I typically do 4TB) and put those in suitable replication (mirror if just 2, raidz2, or raidz3 if more than 2 drives).

In that scenario, your boot drive doesn’t store anything mission critical as you can always reinstall and re-import the actual storage pool, then re-import the instances and volumes from that.

We also have a backup API which can be used to get a backup of the OS data so it can be restored. That’s useful if the drive needs to be replaced for some reason, or as a way to get a daily or weekly backup for emergencies.

2 Likes

Thanks a lot, @stgraber - I was not aware of complexity involved in the modern secure EFI boot process, your explanation was very useful and I am starting to understand the IncusOS approach.

What about dedicated servers with 2 HDD’s? I would like to use RAID1 in this case. 80G for the OS and 1.9G for the ZFS storage. I’m currently doing RAID1 with LUKS encryption on OVH servers. This works fine with Ubuntu, but I would also like to have this with IncusOS.

(OS partition is not encrypted, only the ZFS partitions)

Or do I need to keep using other distros? Will you continue to support this, or is the intention that everything moves over to IncusOS?

We have fied Add "local" ZFS pool RAID configuration options · Issue #532 · lxc/incus-os · GitHub to track relaxing of the restrictions on the local storage pool.

We have no plan to stop supporting Incus as normal distro packages, so that’s always going to be an option.

We basically want to always have 3 ways to deploy Incus:

  • Manually by installing traditional distro packages
  • Automatically by using incus-deploy (Ansible playbook) still using traditional distro packages
  • By running IncusOS with its included Incus version

The crowd for each of the three options is very different and we want to cater to all of them.

1 Like

For anyone following this topic, we have now implemented support for mirroring or extending the ZFS partition on the boot drive.

It’s not a full RAID1 of the entire set of system partitions but it will let you do RAID1 or RAID0 of the local ZFS pool which stores all Incus instances.

When setting up RAID1 with another drive of the same size, we will reserve the beginning 35GB or so of that drive for an eventual full RAID1 of the remaining partitions (future proofing).

1 Like
2 Likes

I got IncusOS working on an OVH dedicated server with 2 x nvme 960Gb.
As specified in the docs, the OS is installed on only one nvme with the local pool.
I have edited the storage config to extend the local pool to the second nvme, but it’s been more than 1 hour and the storage config shown doesn’t have changed: the second nvme doesn’t appear in the pool and the type stayed zfs-raid0.

  1. Is it possible to get the info about the zfs resilvering process?
  2. How long should it takes for drives that size considering there is nothing on the local pool (no instance was created for now)

@gibmat

Maybe look at incus admin os debug log

Can you share the configuration you sent IncusOS to expand the “local” pool to RAID-1? The yaml parser is kind of annoying in that it will happily ignore invalid fields without raising an error, so if you accidentally submit valid yaml but it doesn’t match the expected struct you wouldn’t see any changes.

For an empty pool, the resilvering should only take a second or two. We are working on adding nicer support for checking the status of a ZFS pool and its resilvering status (Perform ZFS scrubbing · Issue #624 · lxc/incus-os · GitHub).

Here is the edited config I saved:

storage config

config: {}
state:
drives:

  • boot: false
    bus: nvme
    capacity_in_bytes: 9.60197124096e+11
    id: /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00139
    model_family: “”
    model_name: SAMSUNG MZQLB960HAJR-00007
    remote: false
    removable: false
    serial_number: S437NA0MB00139
    smart:
    enabled: true
    passed: true
  • boot: true
    bus: nvme
    capacity_in_bytes: 9.60197124096e+11
    id: /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00607
    member_pool: local
    model_family: “”
    model_name: SAMSUNG MZQLB960HAJR-00007
    remote: false
    removable: false
    serial_number: S437NA0MB00607
    smart:
    enabled: true
    passed: true
    pools:
  • devices:
    • /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00607-part11
    • /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00139
      model_family: “”
      encryption_key_status: available
      name: local
      pool_allocated_space_in_bytes: 9.63661824e+08
      raw_pool_size_in_bytes: 9.19123001344e+11
      state: ONLINE
      type: zfs-raid1
      usable_pool_size_in_bytes: 9.19123001344e+11
      volumes:
    • name: incus
      quota_in_bytes: 0
      usage_in_bytes: 9.6147456e+08
      use: incus

@stgraber there is nothing about the local pool, zfs or storage in debug log

IncusOS ignores anything under the state property when sent back via the API. You need to submit changes using the config property:

config:
  pools:
  - name: local
    type: zfs-raid1
    devices:
    - /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00607-part11
    - /dev/disk/by-id/nvme-SAMSUNG_MZQLB960HAJR-00007_S437NA0MB00139
1 Like

It works, thanks!

Is there a way to know what properties should I include when editing a config yaml with the incus command?

@gibmat we should probably have Storage - IncusOS documentation link to incus-os/incus-osd/api/system_storage.go at main · lxc/incus-os · GitHub

And in general do that for all the reference pages.

It’s not exactly the most user friendly source of information but it would be better than not having any or having to rely exclusively on examples.