Reference Architecture for Storage: BTRFS & Hardware Raid

squeeky_btrfs · May 27, 2021, 2:00pm

Hi folks!

Looking for thoughts on best practice here. How would you setup the storage system for an R730XD with H330 raid card and lots of disks?

Here is where we are leaning:

VirtualDisk1: 3 drive raid 1 (2+ hot spare) for Host OS - ext4
VirtualDisk2: 11 drive raid 10 (10 + hot spare) for containers, un-formatted during Host install … leaving lxd init to format raw block device to BTRFS

Any difference between above and a single big virtual device with 2 partitions?
Will BTRFS scrub be able to clean and fix FS issues when it only sees 1 disk (and not multiple disks in BTRFS-Raid1)?
Any way to (safely!) mount the BTRFS VDisk2 to the host so we can see it (to check DF for example)? … or does lxd need to completely own that thing and keep it hidden in namespace.

Background:

We have been running lxd/lxc for about 3-4 years in a small biz environment (~8 Hosts and ~25 containers). Our start was a little disorganized and we have a few different underlying disk configs and would like to get things cleaned up.
We use a single storage Pool so our containers are easily portable between hosts.
We have had 1 failure of a BTRFS filesystem that was not recoverable … 2 or 3 serious issues that we were able to recover from … and are currently fighting an issue at the moment.
We use 2 mosaic instances at different sites as a core to our backup and disaster recovery process.

Thanks for any insight you can share!

stgraber · May 27, 2021, 2:33pm

That setup should be fine. Using two separate RAIDs should be just fine. You technically “waste” one more drive as a unified array with one spare would have let you consume an additional drive (though not in RAID10 as a you need an even number).

As the RAID is handled in hardware, the OS shouldn’t really care about any of it and LXD will definitely be happy to consume that block device for btrfs.

As for filesystem level repair, since btrfs will only see a single drive, it won’t be able to do any repairing itself. Instead you’ll be entirely reliant on your hardware RAID solution doing a goob job at not corrupting data. That’s the main reason why many have moved away from hardware RAID as filesystem level data replication tends to be much more flexible when recovering from failure, not to mention, much cheaper to replicate on drive failure as only the used space needs replicating instead of the entire drive.

It’s save to mount /dev/sdb on /mnt or something to inspect what’s going on. The kernel will detect it’s the same block device being mounted twice and will effectively convert the second mount into something similar to a bind-mount.

squeeky_btrfs · May 27, 2021, 7:13pm

Hi Stephane!

Thanks for the quick reply! Very helpful.
Since we’ve now had a multiple issues with btrfs, our confidence is a little shaken. The server we are currently having issues with has no underlying disk issues (HW raid) and a clean power off has resulted in us not being able to mount the btrfs partition in rw (We can mount with recover,ro option and the key container directories seem fine so there is a path to recovery but…)

Would we be better off (albeit with worse performance) leaving the 11 disks for the containers in JBOD and letting btfrs handle raid?

Thanks!

stgraber · May 27, 2021, 7:27pm

In theory, yes, though I’d recommend checking for any known issue with btrfs RAID and the kernel version you’re using. Btrfs RAID has had a number of issues in the past (though mostly around raid 5/6 type setups) so you’ll want to make sure that the kernel you’re using has solid btrfs raid support.