Critical corruption issues with LXD and ZFS

angristan · May 2, 2019, 4:42pm

A few weeks ago, I had lots of partially corrupted files on my pool. I explained everything in details here:

I never knew how it happened in the first place, but after migrating all my containers to a knew server, everything was fine for nearly 2 months.

Yesterday, I apply APT updates to all my containers and VMs, and then reboot the hosts (without powering off the containers beforehand).

Everything is fine, but then a data import to Elasticsearch fails because of an Input/output error on PostgreSQL files…

And here we go again:

root@kokoro ~# zpool status -v
  pool: zpool-lxd
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h22m with 0 errors on Sun Apr 14 00:46:21 2019
config:

        NAME        STATE     READ WRITE CKSUM
        zpool-lxd   DEGRADED     0     0   173
          sda3      DEGRADED     0     0   346  too many errors

errors: Permanent errors have been detected in the following files:

        /var/snap/lxd/common/lxd/storage-pools/zpool-lxd/containers/elasticsearch/rootfs/usr/share/kibana/src/legacy/core_plugins/kibana/public/discover/components/field_chooser/lib/detail_views/string.html
        /var/snap/lxd/common/lxd/storage-pools/zpool-lxd/containers/elasticsearch/rootfs/usr/share/kibana/node_modules/@babel/core/node_modules/lodash/xorWith.js
        /var/snap/lxd/common/lxd/storage-pools/zpool-lxd/containers/elasticsearch/rootfs/usr/share/kibana/node_modules/graphql-extensions/node_modules/core-js/modules/_object-dp.js
        /var/snap/lxd/common/lxd/storage-pools/zpool-lxd/containers/postgresql/rootfs/var/lib/postgresql/11/main/base/49368/73649
        /var/snap/lxd/common/lxd/storage-pools/zpool-lxd/containers/postgresql/rootfs/var/lib/postgresql/11/main/base/49368/49505

I assume this is related to the reboot, but I’m not sure.

This is extremely worrying and if it’s not my fault then there is a serious issues with LXD or ZFS… I don’t know what I’m going to do, I guess I’ll end up using the dir backend…

I don’t expect to recover the files, they probably only have some blocks corrupted, but I really don’t want this to happen again, so I’ll take any help on this.

micky · May 2, 2019, 5:35pm

Hi angristan,

what version of linux kernel, zfs and lxd are you running?
I’ve 7 servers in production, some of them started with lxd 2.x then upgrade to 3.0.x, now all of them are running lxd 3.0.3
On all servers were installed linux 4.9.x and zfs 0.6.5.9 (Debian Stretch versions) then upgrade to linux 4.19.16 and zfs 0.7.12-1 taken from backports, I’ve never had file loss or corruption.
I’ve upgrade to zfs 0.7.12 only because sometimes in case of a massive
number of open files the server would freeze, it happends most of time during mysql backup, that force me to restart the server taking power off and then on, even after taking off power I’ve never lost or corrupted a file.
All servers has at least 2 hard disks, some 3 or 4.

I’m using Debian Stretch and so I’ve packaged lxd by hand copying most of debian/* from Ubuntu’s LXD package.

angristan · May 2, 2019, 5:46pm

LXC 3.12, 4.15.0-48-generic (on Ubuntu 18.04) and whatever version of ZFS is bundled with the smap, which is probably 0.7.5.

This is a VM with a single disk. I know a single disk does not help, but still, something is wrong

stgraber · May 3, 2019, 4:37am

The first step would be to run a zpool scrub to have ZFS scan the entire pool to detect any other problem with it.

Those kind of errors usually don’t occur because of a ZFS bug but rather because of an issue with the underlying block device, in this case the virtual disk of your VM or possibly the physical disk backing that. A hard shutdown might have caused data which ZFS thought were synced to disk to get lost within the hypervisor or something happened on the physical drive.

In any case, your best bet at this point is zpool scrub to see exactly what the damage is like. If you are the operator of the VM host, it’d likely be good to know more about the storage setup at that level, what’s backing those VMs and the hypervisor configuration with regard to I/O.

Libvirt settings like unsafe I/O or even some of the writeback settings do cause I/O buffering in the hypervisor, a crash of the hypervisor or the physical host will cause all that accumulated data to get lost. Filesystems in your VM normally assume that flushing data to disk with a sync call does mean the data at that given point in time was persistently written, if that’s not the case, bad things can happen.

If this was a physical system rather than a VM, I’d also have recommended running on RAID1 to keep yourself away from potential disk issues, but given you’re using a VM, RAID in the VM is unlikely to help as all disks are likely to be provided from the same source with the same hypervisor configuration.

angristan · May 3, 2019, 7:30am

Thanks for you answer!

I agree, although I’ve never had to hard shutdown reboot this VM. I rebooted it with the reboot command, which I assumes cleanly kills all the processes including those inside the containers.

I’ve run a countless number of scrubs, sometimes new errors appears, sometimes they go away… But I know what is corrupted, I would like to understand why

Indeed. Last, time, since errors kept on appearing, I told my hosting provider (Hetner Cloud) about the problem and asked if they noticed anything abnormal about the disk on the hypervisor, and they answered that everything was fine.

The thing is, some of the corrupted files are files that are never written to (e.g. Kibana’s node modules).
I guess that I should be extra careful when shutting down the VM, ensuring that everything is synced to disk and that every container is shut down. But the system should handle that for me though, shouldn’t it?

From sync(8):

sync should be called before the processor is halted in an unusual manner (e.g., before causing a kernel panic when debugging new kernel code). In general, the processor should be halted using the shutdown (8) or reboot (8) or halt (8) commands, which will attempt to put the system in a quiescent state before calling sync (2).

CyrusTheVirusG · May 3, 2019, 6:30pm

Avoid ZFS like the plague, always ends in tears when I try to use it.

CEPH has been very reliable, would use btrfs if that isn’t an option though.

Xaoc · May 3, 2019, 8:06pm

I totally disagree with what CyrusTheVirusG said. ZFS is perfectly stable, much mature and reliable than btrfs, but this is not the point here… ZFS can cause similar problems like yours when it is not set corectly. If you would like to use ZFS better spend some time to read its manual.

So for what I understand you have no idea what hardware is backing your VM and you are just using a disk attached to your VM. Since this disk can be, and probably is, a LUN that have some RAID array behind it and ZFS doesn’t have knowledge of it., so ZFS got tricked that all data is flushed on physical device but it is actually stuck in RAID cache or hypevisor cache, as it was mentioned by Stephane and this is a potential reason for your data loss. Anyways it is strictly mentioned in ZFS documenration which hardware configurations to avoid and why…probably your HW config is one of those.

Also it is pointless to use ZFS with a single disk so I would suggest to switch to LVM backend of LXD instead of switching to dir backend. Also you said that a “reboot” was used …so this is not shuttingdow your applications on proper way and depending on your apps this can also cause data loss.

And last… You could try to recover your PostgreSQL databases by using WAL.

angristan · May 4, 2019, 10:53am

I asked the ZFS subreddit about my setup, and there are some interesting answers and insights:

CyrusTheVirusG · May 5, 2019, 6:03am

You disagree that it always ends in tears when I try to use it?? It does and I read the documentation. I was all for it even after I tried the default lxd configuration that sets it up as a loop device and lost test containers. Which prompted the documentation reading.

It’s probably the most unreliable filesystem I’ve ever used. Never before have I lost data to a filesystem, bad disk sure. Accidentally rm * -rf in the wrong directory ya, but the only thing it’s good for is the “My filesystem ate my homework” excuse IMO and if you totally disagree why are have him switch to LVM, ZFS is great. Not a fan of btrfs either but I know ZFS will screw me eventually and should not be anywhere near the default settings for LXD but for some reason is/was.

allquixotic · May 6, 2019, 1:52pm

I’ve been using ZFS with LXD since it came out with Ubuntu 16.04, and I’ve never had any problems. I have 8 containers, 3 kvm VMs (one Windows, one Hackintosh, one Linux) and 2 x 4 TB HDDs running Ubuntu 18.04 host on a big iron box with 56 cores and 512 GB RAM. The zpool is in mirror mode for total 4 TB usable storage and I boot from ZFS root in UEFI mode. Each of my VMs and containers has its own zvol.

Since I installed this system, I’ve seamlessly upgraded to Ubuntu 18.04 host from 16.04, moved from a smaller 128 GB / 4 core system to this much bigger hardware (using zfs send to transfer the data over ssh) and I’ve never lost a single bit of data. I run zfs scrub every 1-3 months and it never finds any incorrect data on either drive, even after a kernel panic that happened about a week ago (the first one I’ve ever gotten) that appears to have been an unrelated bug in kvm.

I run several real-time multiplayer dedicated servers and numerous web servers and streaming media servers on this box, so it’s pretty busy with lots of diverse workloads. I’ve definitely done dumb things when learning and configuring the box, but I’ve never lost data. This strongly smells of YDIW.

angristan · May 6, 2019, 2:01pm

I guess something is wrong with my setup and Hetzner…

bodleytunes · May 6, 2019, 2:13pm

I use Hetzner bare metal and a couple of hetzner cloud servers. The cloud servers have two block devices attached and I’m using ZFS, never had any issues with ZFS whether it be mirrored or single disk. There is a valid use of single disk zfs in that its very easy to migrate your drive to a remote host using zfs send / recieve and syncoid. Also the snapshots before changes are useful.

I would say it seems to be the most robust FS I’ve ever used. Then again I don’t have much experience with heavy database usage on there, but I do have elasticsearch running on a setup where I work, but its only been running a month or so. In that particular case, its running in a VM hosted on esxi.

Most of my other stuff is running ZFS on bare metal and I present the block drives for use by zfs or when the Raid cards doesn’t allow jbod I RAID0 each disk and give that to ZFS (possibly bad practice but works okay so far). I’m not a storage guy, I’m more Networks.

Cheers!
Jon.

CyrusTheVirusG · May 6, 2019, 6:24pm

4x 2TB storage disk, 2x1TB SSD Log device and 2x1TB SSD Cache devices was my old configuration, maybe the log and cache is where I went wrong.

I wish you luck on your single point of failure.

New setup is 4x1TB SSDs on 4 hosts dedicated to Ceph, also running Iron Systems servers. These don’t handle containers at all.

TomvB · May 8, 2019, 4:41am

Same here. 3 bare-metal servers without ECC memory. 0 issues with ZFS mirrors.

bodleytunes · May 28, 2019, 7:11pm

No SPOF as using ZFS raid1 mirror

TomvB · May 29, 2019, 2:18pm

@stgraber ZFS on linux 0.8.0 is ready and contains* a lot of fixes. https://github.com/zfsonlinux/zfs/milestone/12

Is 0.8.0 available / coming to Ubuntu 18.04?

stgraber · May 29, 2019, 5:19pm

It may come to Ubuntu 18.04 as a PPA and possibly through a kernel update in 6-8 months once 19.10 is released and its kernel is backported but it won’t just be pushed as a bugfix release.

VERIFIorentino_2018 · May 30, 2019, 9:48am

I would take the response from any cloud provider with a pinch of salt. (obviously everyone has different experiences).

It clear when you say

sometimes new errors appears, sometimes they go away

From my understanding, systematic errors are caused by software and random errors by hardware.

From my limited experience with 2 different European providers - escalating the issue - and after weeks of telephone call - solved issues. They asked us of our setup - when we explained that we have unusual setup with ZFS/GPU/HPC computing - they immediately blamed our setup. The usual explanation is you need to fix your stuff - all is OK on our side. Weeks later, we got a nice mail from a manager saying sorry that it was an unusual hardware error - and some coupon codes. (We later hence moved on-prem as it was affecting our research).

Of course best would be to try the same setup at some other provider and see if there are failures.

petergloor · June 15, 2019, 6:22pm

@angristan : I’ve just read your blog post about “Setup a ZFS pool for your LXC containers with LXD”.

Regarding your issues, are you using the default loop device for your ZFS pool or are you using a partition of your virtual “real device”? Did you ever try the loop device?

I’m currently thinking about a mirror configured from two volumes, but I’m afraid that this will be too slow.

Peter

Skaperen · June 16, 2019, 7:47pm

except for it happening on data not written to, it sounds like a virtual device is not propagating the final sync to the host w/o the poweroff. but i can’t say PgSQL is leaving such data alone, especially if in the same table.