BTRFS issues: storage-pools/btrfs empty and btrfs quota 100% while inside the VM only 48% utilized

Hello,

My setup exists out of 3 nodes with ubuntu 20.04 HWE Kernel, snap LXD 4.17 and a Ceph 15 cluster. Have had this setup for ~1.5 years now. Very happy with LXD 4.x :slight_smile:
All VMs/containers run Ubuntu 20.04 HWE with the exception of 2 which are out of scope for this issue.

Some VMs have their storage on Ceph, clustered VMs have their storage on a local disk with btrfs mounted in /btrfs

One of my VMs was having trouble, / was mounted as ro and i/o errors in kernel logs so I shut it down. Now it is unable to start.


root @ node2 # lxc start kubew2
Error: Failed to create file "/var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml": open /var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml: disk quota exceeded
Try `lxc info --show-log kubew2` for more info


root @ node2 # lxc info --show-log kubew2
Name: kubew2
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: node2
Created: 2021/05/24 13:53 UTC
Last Used: 2021/08/13 11:40 UTC
Error: open /var/snap/lxd/common/lxd/logs/kubew2/qemu.log: no such file or directory

It tries to write backup.yaml in /var/snap/lxd/common/lxd/virtual-machines/kubew2/ but that directory does not exist.

ls: cannot access '/var/snap/lxd/common/lxd/virtual-machines/kubew2/': No such file or directory

All symlinks in /var/snap/lxd/common/lxd/virtual-machines leading to the btrfs storage pool are dead symlinks on all my three nodes

root @ node2 # ls /var/snap/lxd/common/lxd/virtual-machines -l
total 20
lrwxrwxrwx 1 root root 67 May 24 16:07 kube2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kube2
lrwxrwxrwx 1 root root 68 May 24 16:07 kubew2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew2
lrwxrwxrwx 1 root root 68 Jun 10 13:49 kubew5 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew5
lrwxrwxrwx 1 root root 66 May 12 09:16 plex -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/plex
lrwxrwxrwx 1 root root 59 Jan 25  2021 smb1 -> /var/snap/lxd/common/lxd/storage-pools/ceph/containers/smb1
lrwxrwxrwx 1 root root 74 Mar  1 11:47 transmission2 -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/transmission2

Because they lead to /var/snap/lxd/common/lxd/storage-pools/btrfs which is empty on all my 3 nodes

root @ node2 # ls /var/snap/lxd/common/lxd/storage-pools/btrfs -la
total 8
drwx--x--x 2 root root 4096 May 24 14:38 .
drwx--x--x 5 root root 4096 May 24 14:38 ..

Lucky me this VM that refuses to start is a kubernetes worker node that I can live without.

Disk usage inside my VM is ~42%. The size is 30GB and it only utilized ~12GB

However btrfs thinks otherwise. If I’m not mistaken it thinks the full 28.03GiB have been used.


root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew2
virtual-machines/kubew2
        Name:                   kubew2
        UUID:                   a880e030-233b-d84b-9c0d-9723ce7ba096
        Parent UUID:            -
        Received UUID:          -
        Creation time:          2021-05-24 15:53:45 +0200
        Subvolume ID:           305
        Generation:             299360
        Gen at creation:        147
        Parent ID:              5
        Top level ID:           5
        Flags:                  -
        Snapshot(s):
        Quota group:            0/305
          Limit referenced:     28.03GiB
          Limit exclusive:      -
          Usage referenced:     28.03GiB
          Usage exclusive:      28.03GiB


root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew5
virtual-machines/kubew5
        Name:                   kubew5
        UUID:                   3e718300-7a51-9546-9477-074dce34eb7d
        Parent UUID:            8c39712e-7bc6-4548-ad8b-718fe3f165e6
        Received UUID:          -
        Creation time:          2021-06-10 13:49:41 +0200
        Subvolume ID:           347
        Generation:             302163
        Gen at creation:        50161
        Parent ID:              5
        Top level ID:           5
        Flags:                  -
        Snapshot(s):
        Quota group:            0/347
          Limit referenced:     28.03GiB
          Limit exclusive:      -
          Usage referenced:     10.48GiB
          Usage exclusive:      10.48GiB

both /btrfs/virtual-machines/kubew2/root.img and /btrfs/virtual-machines/kubew5/root.img have a size of 30000005120

I could perhaps manually increase the quota. Other workers like kubew1 and kubew3 were created at the same time as kubew2, all three have ~13GB used when checked with df -h /. Yet their btrfs subvolume quota state 22.3GiB and 22.08GiB.

root @ node1 # btrfs subvolume show /btrfs/virtual-machines/kubew1 | tail -n 4
          Limit referenced:     28.03GiB
          Limit exclusive:      -
          Usage referenced:     22.36GiB
          Usage exclusive:      22.36GiB

kubew1 ❯ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        27G   13G   15G  47% /

root @ node3 # btrfs subvolume show /btrfs/virtual-machines/kubew3 | tail -n 4
          Limit referenced:     28.03GiB
          Limit exclusive:      -
          Usage referenced:     22.08GiB
          Usage exclusive:      22.08GiB

kubew3 ❯ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        27G   15G   13G  53% /

In the end I have two questions/issues

Why is /var/snap/lxd/common/lxd/storage-pools/btrfs empty (Leaving you with dead symlinks elsewhere) and why is the storage usage on btrfs storage backend way higher than the actually in use storage?

For the latter I’m going to guess the thin provisioned qemu disk ?

I’m gonna need to increase the quotas on my other VMs before they all run into this issue. I can live without 1-3 k8s workers but if one dies then the others get a higher load and download more images and fill up storage so before it all comes down like a domino…

I’ll not “fix” my broken kubew2 by manually increasing the quota incase anyone wants me to do some debugging/tests

That’s because of the use of mount namespaces. The path only looks empty from where you’re looking at it. From LXD’s point of view, it’s not empty at all.

You can look at /var/snap/lxd/common/mntns/var/snap/lxd/storage-pools/btrfs to get an idea of what LXD sees.

And yes, all virtual machines use thin provisioning (in this case a sparse file) rather than pre-allocate space.

That’s because of the use of mount namespaces. The path only looks empty from where you’re looking at it. From LXD’s point of view, it’s not empty at all.

Cool :slight_smile: TIL. I knew it still worked somehow because all my other VMs were able to start just fine. That dir does not exist but I found it in /var/snap/lxd/common/mntns/var/snap/lxd/common/lxd/storage-pools/btrfs/

And yes, all virtual machines use thin provisioning (in this case a sparse file) rather than pre-allocate space.

As I expected. A known “issue” with thin provisioned disks.

But right now I still have a VM that is not able to start because the btrfs quota has been hit and lxd cannot write a backup.yaml file.

root @ node2 # echo "foo" > /btrfs/virtual-machines/kubew2/bar
zsh: disk quota exceeded: /btrfs/virtual-machines/kubew2/bar

Is the subvolume quota identical to the size of the VM? There are more files in the directory other than the root.img file itself so if that is the case you will hit the quota before you hit the “max” of your .img file.

To try and recreate the problem at hands I just created a new VM using my btrfs as storage. These steps are reproducable.

lxc launch images:ubuntu/focal/cloud -p default -p cloud-init --vm --storage btrfs --target node3

I noticed that not specifying a storage size, there is no subvolume quota

root @ node3 # lxc config show hip-bat -e
<snip>
devices:
  eth0:
    nictype: bridged
    parent: br121
    type: nic
  root:
    path: /
    pool: btrfs
    type: disk

root @ node3 # btrfs subvolume show /btrfs/virtual-machines/hip-bat | tail -n 5
        Quota group:            0/440
          Limit referenced:     -
          Limit exclusive:      -
          Usage referenced:     10.57GiB
          Usage exclusive:      8.29GiB

VM gets a standard 10GiB disk:

root@hip-bat:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       8.9G  954M  8.0G  11% /

Both which is fine. I did not specify a size

Once I override the size to 15GB, the following happens:

root @ node3 # lxc config device set hip-bat root size=15GB
root @ node3 # lxc stop hip-bat
root @ node3 # lxc start hip-bat
root @ node3 # btrfs subvolume show /btrfs/virtual-machines/hip-bat | tail -n 5
        Quota group:            0/440
          Limit referenced:     14.06GiB
          Limit exclusive:      -
          Usage referenced:     10.57GiB
          Usage exclusive:      8.29GiB

I download 2 10GB.bin files on my VM, the second one causes the disk to go full. VM panics, remounts / as readonly. Disk quota has been hit on btrfs and VM is unable to start.

root @ node1 # enter hip-bat
root@hip-bat:~#
root@hip-bat:~# wget https://speed.hetzner.de/10GB.bin
--2021-08-17 07:27:39--  https://speed.hetzner.de/10GB.bin
Resolving speed.hetzner.de (speed.hetzner.de)... 2a01:4f8:0:59ed::2, 88.198.248.254
Connecting to speed.hetzner.de (speed.hetzner.de)|2a01:4f8:0:59ed::2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10485760000 (9.8G) [application/octet-stream]
Saving to: ‘10GB.bin’

10GB.bin                           100%[===============================================================>]   9.77G   152MB/s    in 71s

2021-08-17 07:28:52 (140 MB/s) - ‘10GB.bin’ saved [10485760000/10485760000]

root@hip-bat:~#
root@hip-bat:~#
root@hip-bat:~# wget https://speed.hetzner.de/10GB.bin -O foo.bin
--2021-08-17 07:29:34--  https://speed.hetzner.de/10GB.bin
Resolving speed.hetzner.de (speed.hetzner.de)... 2a01:4f8:0:59ed::2, 88.198.248.254
Connecting to speed.hetzner.de (speed.hetzner.de)|2a01:4f8:0:59ed::2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10485760000 (9.8G) [application/octet-stream]
Saving to: ‘foo.bin’

foo.bin                             11%[======>                                                         ]   1.14G  47.1MB/s    in 13s


Cannot write to ‘foo.bin’ (Success).
root@hip-bat:~# journalctl -fk
-- Logs begin at Tue 2021-08-17 07:14:00 UTC. --
Aug 17 07:29:46 hip-bat kernel: EXT4-fs warning (device sda2): ext4_end_bio:309: I/O error 10 writing to inode 39839 (offset 1094713344 size 4194304 starting block 3370240)
Aug 17 07:29:47 hip-bat kernel: JBD2: Detected IO errors while flushing file data on sda2-8
Aug 17 07:29:48 hip-bat kernel: Aborting journal on device sda2-8.
Aug 17 07:29:48 hip-bat kernel: EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
Aug 17 07:29:48 hip-bat kernel: EXT4-fs (sda2): Remounting filesystem read-only
Aug 17 07:29:48 hip-bat kernel: EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
Aug 17 07:29:48 hip-bat kernel: EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
Aug 17 07:29:48 hip-bat kernel: EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
Aug 17 07:29:48 hip-bat kernel: EXT4-fs (sda2): ext4_writepages: jbd2_start: 0 pages, ino 39839; err -30
Aug 17 07:29:48 hip-bat kernel: JBD2: Detected IO errors while flushing file data on sda2-8
^C
root@hip-bat:~# mount | grep ext4
/dev/sda2 on / type ext4 (ro,relatime)
root@hip-bat:~# exit

root @ node2 # lxc stop hip-bat
root @ node2 # lxc start hip-bat
Error: Failed to create file "/var/snap/lxd/common/lxd/virtual-machines/hip-bat/backup.yaml": open /var/snap/lxd/common/lxd/virtual-machines/hip-bat/backup.yaml: disk quota exceeded

To me it seems the btrfs quota is not necessary for VMs. The disk size limit is already set on the .img file. The btrfs quota here is causing breaking issues.

I forgot to include a df -h / output in my example to clarify that the VM disk isn’t full but the btrfs quota is the problem. So I created a new VM

root@cunning-jay:~# wget https://speed.hetzner.de/10GB.bin -O foo.bin
..
Length: 10485760000 (9.8G) [application/octet-stream]
Saving to: ‘foo.bin’

foo.bin                            100%[===============================================================>]   9.77G   156MB/s    in 64s

2021-08-17 08:24:15 (157 MB/s) - ‘foo.bin’ saved [10485760000/10485760000]

root@cunning-jay:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        14G   11G  2.7G  81% /
root@cunning-jay:~# rm foo.bin
root@cunning-jay:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        14G  956M   13G   7% /
root@cunning-jay:~# wget https://speed.hetzner.de/10GB.bin -O bar.bin
...
Length: 10485760000 (9.8G) [application/octet-stream]
Saving to: ‘bar.bin’

bar.bin                             17%[==========>                                                     ]   1.71G  57.5MB/s    in 13s


Cannot write to ‘bar.bin’ (Success).
root@cunning-jay:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        14G  2.7G   11G  20% /

The VM disk itself isn’t full but the btrfs quota is. I believe this is not intended behavior. The btrfs quota should be there for containers but not for VMs.

I manually deleted the quota limit and now my VM can freely grow between 100% disk usage and 10% usage without issues. The thin image will not grow beyond it’s max size (15GiB). The subvolume shows slightly higher usage than the original configured 15GB limit because of the other files and the thin image metadata etc.

root @ node3 # btrfs qgroup show /btrfs/virtual-machines
qgroupid         rfer         excl
--------         ----         ----
<snip>
0/471        15.23GiB     12.95GiB

Hi @vosdev

I’ve been looking into this and I believe the quota being set is correctly, however we have accounted for the VM disk file in the quota.

Here’s an example.

First, lets create a fresh BTRFS pool and create a VM in it (with no quota):

lxc storage create btrfs btrfs
lxc init images:ubuntu/focal v1 --vm -s btrs

Now lets check that is correct according to BTRFS:

btrfs subvolume show /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/
virtual-machines/v1
	Name: 			v1
	UUID: 			bacbd6e9-1c5a-c54d-bd27-0c12a43abcab
	Parent UUID: 		78fef1a0-71bb-884b-aa42-bbe1e3badd8a
	Received UUID: 		-
	Creation time: 		2021-08-17 11:31:01 +0100
	Subvolume ID: 		266
	Generation: 		34
	Gen at creation: 	34
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/266
	  Limit referenced:	-
	  Limit exclusive:	-
	  Usage referenced:	4.00GiB
	  Usage exclusive:	16.00KiB

OK so no quote limit set yet. Good.

Lets have a look at the VM’s directory before its been started.

du -m /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/
1	 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/templates
4097 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/

So we have about 4GB usage from the disk image file, which tallies with the BTRFS Usage referenced value above. Notice that because this has been created from a BTRFS snapshot the actual (exclusive) usage is a lot less at the moment.

Now start the VM, and recheck:

lxc start v1
du -m /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/templates
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1//config/cloud-init
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config/systemd
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config/udev
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config/files
17	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config.mount/cloud-init
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config.mount/systemd
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config.mount/udev
1	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config.mount/files
17	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/config.mount
4130	/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/

We can see the usage has increased by about 33MB, and that config and config.mount both use 17MB of that (which accounts for basically the entire growth). This is the lxd-agent being copied into the “config” directory, and the config.mount is a readonly mount of it and does not actually take up space).

Lets also check the BTRFS usage:

btrfs subvolume show /var/lib/lxd/storage-pools//btrfs/virtual-machines/v1/
virtual-machines/v1
	Name: 			v1
	UUID: 			bacbd6e9-1c5a-c54d-bd27-0c12a43abcab
	Parent UUID: 		78fef1a0-71bb-884b-aa42-bbe1e3badd8a
	Received UUID: 		-
	Creation time: 		2021-08-17 11:31:01 +0100
	Subvolume ID: 		266
	Generation: 		41
	Gen at creation: 	34
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/266
	  Limit referenced:	-
	  Limit exclusive:	-
	  Usage referenced:	4.10GiB
	  Usage exclusive:	100.84MiB

The usage exclusive has now increased to 100MB and this will include the lxd-agent and changes to the root file as the VM has been booted.

I then stopped and started the VM several times to check these 2 figures weren’t significantly growing on every boot and they weren’t. But you would expect some growth as the filesystem inside the VM allocates block from the disk image file.

After 3 reboots it went to 105MB exclusive usage.

Quota

Now lets also look at the size of the disk image file (not this is the max size not the actual used size):

ls -la /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img 
-rw-r--r-- 1 root root 10000007168 Aug 17 11:41 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

OK so its the default size of 10GB. Good.

Now lets set a quota:

lxc stop v1
lxc config device set v1 root size=15GB

Check the disk image has been resized:

ls -la /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img 
-rw-r--r-- 1 root root 15000002560 Aug 17 11:44 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

15GB, OK good.

Lets now check the BTRFS quota, this time I’m using the --raw flag as I want to see the quota in bytes.

btrfs subvolume show --raw /var/lib/lxd/storage-pools//btrfs/virtual-machines/v1/
virtual-machines/v1
	Name: 			v1
	UUID: 			b3a373ce-4759-8e42-99ab-62d204aeb555
	Parent UUID: 		78fef1a0-71bb-884b-aa42-bbe1e3badd8a
	Received UUID: 		-
	Creation time: 		2021-08-17 11:44:06 +0100
	Subvolume ID: 		267
	Generation: 		54
	Gen at creation: 	51
	Parent ID: 		5
	Top level ID: 		5
	Flags: 			-
	Snapshot(s):
	Quota group:		0/267
	  Limit referenced:	15100002560
	  Limit exclusive:	-
	  Usage referenced:	4319490048
	  Usage exclusive:	24391680

There is now a BTRFS quota applied, however the limit size is larger than the 15000000000 (15GB) we set. It is 100002560 bytes larger.

What’s actually happening is that LXD attempting to set the BTRFS filesystem quota to 100MB (this is the default size for the VM config volume and can be controlled by specifying the size.state property on the LXD VM volume).

Under the hood LXD’s BTRFS driver is noticing that this is a VM volume and is aware that the VM’s block disk image will also be being stored in the config volume and is adding on the disk image’s maximum size to the quota.

So above we can see disk image’s size is 15000002560, plus 100000000 for the 100MB default config volume quota, takes us to the correct BTRFS quota of 15100002560 bytes.

You can see the code for this here:

So now we see that in a basic scenario things seem to be working as expected.

On to your specific issue. The first thing I would want to check is whether you have any snapshots.

As the snapshots consume disk usage from the quota, which effectively reduces the quota for the main volume.

Actually I see you have no snapshots, I’ll try and recreate the issue…

Hey Thomas,

Thanks for the information and looking into this! ! It’s nice to see that the metadata has been taken into account for the quota.

I do not have any snapshots, and I could reproduce this with brand new VMs by simply attempting to write more data than the thin volume’s size to get the thin volume to grow to a full 100%.

As the snapshots consume disk usage from the quota, which effectively reduces the quota for the main volume.

Aren’t these put in a separate quotagroup ?

If there is any information you need just ask. I can post my whole config here if you’d like.

I’ve recreated this issue, but I am not sure what the solution is.

BTW as an aside, I noticed that it was not possible to remove a BTRFS quota set on the VM’s config filesystem subvolume when doing something like lxc config device unset v1 root size which for a container would remove the quota.

This is fixed in:

https://github.com/lxc/lxd/pull/9120

Although the last specified size is still maintained for the block disk file as it cannot have “no quota”.

As for the main problem, I can see that filling up the disk image file does indeed cause the BTRFS subvolume’s reference data size to reach the quota (so over the default 100MB we add beyond the disk image size itself).

The disk image size itself hasn’t grown beyond the original size, its just that BTRFS seems to think that the total referenced data of the subvolume has exceeded the quota.

Using “exclusive” data quotas solves the issue, but I am not sure if that is the correct thing to do.

Each qgroup primarily tracks two numbers, the amount of total referenced space and the amount of exclusively referenced space.

referenced

space is the amount of data that can be reached from any of the subvolumes contained in the qgroup, while

exclusive

is the amount of data where all references to this data can be reached from within this qgroup.

https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-quota

It certainly fixes the problem, but I am unsure whether this would allow the volume to access too much storage.

Also I am unclear how that impacts the snapshot quota - from what I see the snapshots get their own qgroup which ever quota limit type is used.

Any thoughts on this @stgraber ?

I’ve opened an issue for this:

1 Like

Thank you! It’s amazing how well you guys respond to issues here with well written out explanations.
I subscribed to the issue because I’m also curious to see the solution.

I have removed the quota from the subvolumes for the VMs that were “at risk” so I can start all my VMs again. :slight_smile:

1 Like

I jumped on this issue hard. Had a call at 6 in the morning :slight_smile:

My situation

  • Windows VM increased disk size from 200GB to 300GB (there is EFI partition at the end of disk so disk.img is allocated fully)
  • created btrfs snapshot, just in case resize wouldnt work too well
  • btrbk cronjob additionally created btrtfs snapshot outside LXD ecosystem running). Have been using it for many years for LXD now.

And it VM got frozen at some point when I didint lookt at it… Currently I m just disabling btrfs quotas on cronjob since VM size is no concern.

Yeah its not great the way BTRFS quotas work.

We added some documentation around the use of VMs on BTRFS quotas:

“Our recommendation is to not use VMs with btrfs storage pools…”