Btrfs-transacti using 100% CPU and preventing Container starts

JC-Mac · April 11, 2019, 11:47am

Hi,

I am looking for advice on how to get recover my system. btrfs-transacti uses 100% cpu and containers and lxc commands are unresponsive. I can access the host, although reboot and shutdown don’t work (they hang up with a “soft lockup” error). In the past, I have experienced this issue (the cause is a snapshot of a Container running home-assistant) … and by waiting it out … after a few hours, the process would finish, and I would be able to work normally again. This time, I have left the process running for 24 hrs with no success.
System is Ubuntu 17.04 with mixed disk structure (LVM raid array for OS, XFS raid 1 for some media storage, and 2 disks formatted with Btrfs. This was setup while learning (quite a while ago), so can’t confirm the storage setup is proper - the goal was to have the containers on a btrfs storage pool. My fstab mounts a UUID to a /media/btrfs … which is where I think I pointed the storage pool to.
I’ve hard reset the system. When it boots, the btrfs-transacti starts again on its own. LXC list command works and shows all the containers as stopped. If I try to start one, the command hangs, and I can no longer perform any lxc commands.

Any thoughts on how to get my containers working again? Thanks in advance!

Some log info:

dmesg:
13.856265] Btrfs loaded, crc32c=crc32c-generic
[ 27.140411] blk_update_request: I/O error, dev fd0, sector 0
[ 27.144318] floppy: error -5 while reading block 0
[ 27.355362] BTRFS: device fsid 416c708e-381b-45a9-85a3-f8461fb16e26 devid 2 transid 1329485 /dev/sdd
[ 27.360963] BTRFS: device fsid 416c708e-381b-45a9-85a3-f8461fb16e26 devid 1 transid 1329485 /dev/sdc
…

[ 40.790528] BTRFS info (device sdc): disk space caching is enabled
[ 40.790530] BTRFS info (device sdc): has skinny extents

…
[ 103.503035] BTRFS info (device sdc): The free space cache file (16135487488) is invalid. skip it

[ 504.322564] perf: interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
[ 675.721644] perf: interrupt took too long (3151 > 3150), lowering kernel.perf_event_max_sample_rate to 63250
[ 972.851764] perf: interrupt took too long (3948 > 3938), lowering kernel.perf_event_max_sample_rate to 50500
[ 1593.358776] perf: interrupt took too long (4936 > 4935), lowering kernel.perf_event_max_sample_rate to 40500

ps fauxx

root 2194 0.0 0.0 0 0 ? S Apr10 0:01 _ [btrfs-cleaner]
root 2195 98.9 0.0 0 0 ? R Apr10 647:10 _ [btrfs-transacti]

stgraber · April 11, 2019, 2:31pm

Sounds like a kernel bug. You may want to try running a btrfs scrub but given the stuck transaction, I’m not sure that even that will succeed.

gpatel-fr · April 11, 2019, 2:37pm

I don’t think that it is a lxd problem by itself, more a btrfs, using btrfs tools such as btrfs scrub and btrfs-check seem to be the way to go.
btrfs-scrub can be done online but with the disk activity you see, it does not seem realist to expect it to run in a reasonable amount of time.
I’d say that given your problem you will need to go directly to btrfs check.
Unfortunately I don’t know any way to unmount a btrfs filesystem mounted for lxd without stopping lxd. Ideally if a storage pool was bad you could stop all containers running off it, and somehow inactivate the storage pool, but it does not seem possible for now (and if you have only one storage pool it don’t apply anyway of course)

JC-Mac · April 21, 2019, 12:02pm

Thank you both for the insights. Seems to be a btrfs issue since it persisted across reboots with lxd stopped, … and even managed to lockup live instances running on USB sticks.
Interestingly, on an ubuntu live instance, the maxed CPU thread was not btrfs-transacti … but was btrfs-cleaner. This prompted me to read more carefully a bug about quotas … and eventually, in the 30 second window after reboot before the process started, run “btrfs quota disable”. I think this eventually solved the problem. Not 100% sure because I also “scrubbed”, added a 3rd disk to the btrfs “array”, and also tried to wipe out the excessively large (10Gb) Home-Assistant db file.
Thanks Again!

mikesi · April 22, 2019, 5:48pm

I had a few issues with BTRFS and the machine locking up, try to upgrade the kernel to the latest version you can.