Failed initializing storage pool

Hi,

I got a power outage which damaged the motherboard and raid controller of my server. Both are replaced now but I got filesystem issues now which prevents me from starting LXD.

EROR[09-02|13:52:08] Failed to start the daemon: Failed initializing storage pool “default”: Failed to mount “/dev/loop10” on “/var/snap/lxd/common/lxd/storage-pools/default” using “btrfs”: input/output error

I did an btrfs check and repair:

user@lxd:~$ sudo btrfs check --force --readonly --progress /dev/loop10
Opening filesystem to check…
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
Ignoring transid failure
Checking filesystem on /dev/loop10
UUID: 2e657720-4b74-4d85-8aa8-82e0d397fd2d
[1/7] checking root items (0:01:20 elapsed, 38005611 items checked)
[2/7] checking extents (0:06:18 elapsed, 2379598 items checked)
[3/7] checking free space cache (0:00:14 elapsed, 2981 items checked)
root 34451 inode 72602 errors 100, file extent discount elapsed, 761082 items checked)
Found file extent holes:
start: 32552779776, len: 720896
start: 32560250880, len: 2293760
start: 32592232448, len: 65536
start: 32598851584, len: 65536
start: 32600621056, len: 196608
start: 32606650368, len: 196608
start: 32609796096, len: 131072
start: 32622182400, len: 5308416
start: 32627687424, len: 4390912
start: 32649445376, len: 131072
start: 32650493952, len: 65536
[4/7] checking fs roots (0:16:45 elapsed, 1935698 items checked)
ERROR: errors found in fs roots
found 2882307887865 bytes used, error(s) found
total csum bytes: 2757097344
total tree bytes: 38886555648
total fs tree bytes: 31900598272
total extent tree bytes: 3243638784
btree space waste bytes: 7146346875
file data blocks allocated: 71490025865216
referenced 4766790057984

user@lxd:~$ sudo btrfs check --force --readonly --progress --clear-space-cache v1 /dev/loop10
[sudo] password for user:
Opening filesystem to check…
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
parent transid verify failed on 680919613440 wanted 1279753 found 1279731
Ignoring transid failure
Checking filesystem on /dev/loop10
UUID: 2e657720-4b74-4d85-8aa8-82e0d397fd2d
Free space cache cleared

user@lxd:~$ sudo btrfs check --force --repair --progress /dev/loop10
enabling repair mode
Opening filesystem to check…
parent transid verify failed on 680919613440 wanted 1284109 found 1279731
parent transid verify failed on 680919613440 wanted 1284109 found 1279731
parent transid verify failed on 680919613440 wanted 1284109 found 1279731
Ignoring transid failure
Checking filesystem on /dev/loop10
UUID: 2e657720-4b74-4d85-8aa8-82e0d397fd2d
repair mode will force to clear out log tree, are you sure? [y/N]: y
[1/7] checking root items (0:00:08 elapsed, 38003965 items checked)
Fixed 0 roots.
No device size related problem found (0:03:34 elapsed, 2379555 items checked)
[2/7] checking extents (0:03:35 elapsed, 2379555 items checked)
[3/7] checking free space cache (0:00:00 elapsed, 4356 items checked)
Fixed discount file extents for inode: 72602 in root: 34451psed, 761411 items checked)
warning line 3805 roots (0:21:59 elapsed, 2694819 items checked)
[4/7] checking fs roots (0:21:59 elapsed, 2697111 items checked)
[5/7] checking csums (without verifying data) (0:00:55 elapsed, 8717591 items checked)
[6/7] checking root refs (0:00:00 elapsed, 6242 items checked)
Recowing metadata block 680919613440
ERROR: fails to fix transid errors
[7/7] checking quota groups skipped (not enabled on this FS)
found 2881524105985 bytes used, error(s) found
total csum bytes: 2757097344
total tree bytes: 38885851136
total fs tree bytes: 31900598272
total extent tree bytes: 3243753472
btree space waste bytes: 7146550540
file data blocks allocated: 71489245200384
referenced 4766009393152
extent buffer leak: start 680919613440 len 16384

Then I tried starting the service again and got other errors:

Error: Failed to start dqlite server: raft_start(): io: closed segment 0000000000039600-0000000000039684 is past last snapshot snapshot-1-38912-2073756252

I renamed the file 0000000000039600-0000000000039684 then I got the error:

Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000039600-0000000000039732: entries batch 28 starting at byte 268680: entries count in preamble is zero

so I also renamed 0000000000039600-0000000000039732:

root@lxd:/var/snap/lxd/common/lxd/database/global/

1532 -rw------- 1 root root 1568768 Sep 2 15:00 0000000000039600-0000000000039684.moved

1532 -rw------- 1 root root 1564968 Sep 2 15:00 0000000000039600-0000000000039732.moved

now I get:

Error: Failed initializing storage pool “default”: Failed to mount “/dev/loop10” on “/var/snap/lxd/common/lxd/storage-pools/default” using “btrfs”: invalid argument

I checked the storage image file:
sudo btrfs check /var/snap/lxd/common/lxd/disks/default.img
Opening filesystem to check…
parent transid verify failed on 369479548928 wanted 1279751 found 1284111
parent transid verify failed on 369479548928 wanted 1279751 found 1284111
parent transid verify failed on 369479548928 wanted 1279751 found 1284111
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=184647680 item=55 parent level=1 child level=1
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

Now I am stuck :frowning:
Any help is appreciated

I’d recommend making a copy of default.img in case you haven’t already.
It may also be a good idea to check if you’re getting any SMART errors from the drive as you may be dealing with a hardware issue here.

Ideally, what I’d do is transfer default.img to another system, then over there, run btrfs check to scan and repair any damage. Then manually mount it and look for damage. If things look mostly good, I’d use btrfs send and btrfs receive to transfer the subvolumes to a new clean btrfs filesystem.

Doing it that way should let you bypass any remaining hardware issue and should it succeed, will result in a new btrfs filesystem that you can trust to be consistent.

I had to use btrfsck --init-extent-tree default.img to be able to access the data again. I will now try to transfer the data. Thanks!

I reinstalled my OS and LXD and have a clean btrfs pool. I also have my old pool mounted so I can access most containers (some are corrupt). How can I reimport them into the new installation/db?

(I copied a container-folder to the new system and tried lxd recover but this doesn’t seem the right tool.)

So do you have a new default.img with the recovered data in it?

I just copied 1 container folder in it as a test. It takes a very long time copy all the containers so I need to be sure it can be done.

Ok, then you should be able to put your new default.img at /var/snap/lxd/common/lxd/disks/default.img, then run lxd recover, select the option to define a new storage pool, enter default as name, btrfs as type and /var/snap/lxd/common/lxd/disks/default.img as the source.

This should be enough for the recovery logic to load the pool and detect its content.