My setup exists out of 3 nodes with ubuntu 20.04 HWE Kernel, snap LXD 4.17 and a Ceph 15 cluster. Have had this setup for ~1.5 years now. Very happy with LXD 4.x
All VMs/containers run Ubuntu 20.04 HWE with the exception of 2 which are out of scope for this issue.
Some VMs have their storage on Ceph, clustered VMs have their storage on a local disk with btrfs mounted in /btrfs
One of my VMs was having trouble,
/ was mounted as ro and i/o errors in kernel logs so I shut it down. Now it is unable to start.
root @ node2 # lxc start kubew2 Error: Failed to create file "/var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml": open /var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml: disk quota exceeded Try `lxc info --show-log kubew2` for more info root @ node2 # lxc info --show-log kubew2 Name: kubew2 Status: STOPPED Type: virtual-machine Architecture: x86_64 Location: node2 Created: 2021/05/24 13:53 UTC Last Used: 2021/08/13 11:40 UTC Error: open /var/snap/lxd/common/lxd/logs/kubew2/qemu.log: no such file or directory
It tries to write
/var/snap/lxd/common/lxd/virtual-machines/kubew2/ but that directory does not exist.
ls: cannot access '/var/snap/lxd/common/lxd/virtual-machines/kubew2/': No such file or directory
All symlinks in
/var/snap/lxd/common/lxd/virtual-machines leading to the btrfs storage pool are dead symlinks on all my three nodes
root @ node2 # ls /var/snap/lxd/common/lxd/virtual-machines -l total 20 lrwxrwxrwx 1 root root 67 May 24 16:07 kube2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kube2 lrwxrwxrwx 1 root root 68 May 24 16:07 kubew2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew2 lrwxrwxrwx 1 root root 68 Jun 10 13:49 kubew5 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew5 lrwxrwxrwx 1 root root 66 May 12 09:16 plex -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/plex lrwxrwxrwx 1 root root 59 Jan 25 2021 smb1 -> /var/snap/lxd/common/lxd/storage-pools/ceph/containers/smb1 lrwxrwxrwx 1 root root 74 Mar 1 11:47 transmission2 -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/transmission2
Because they lead to
/var/snap/lxd/common/lxd/storage-pools/btrfs which is empty on all my 3 nodes
root @ node2 # ls /var/snap/lxd/common/lxd/storage-pools/btrfs -la total 8 drwx--x--x 2 root root 4096 May 24 14:38 . drwx--x--x 5 root root 4096 May 24 14:38 ..
Lucky me this VM that refuses to start is a kubernetes worker node that I can live without.
Disk usage inside my VM is ~42%. The size is 30GB and it only utilized ~12GB
However btrfs thinks otherwise. If I’m not mistaken it thinks the full 28.03GiB have been used.
root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew2 virtual-machines/kubew2 Name: kubew2 UUID: a880e030-233b-d84b-9c0d-9723ce7ba096 Parent UUID: - Received UUID: - Creation time: 2021-05-24 15:53:45 +0200 Subvolume ID: 305 Generation: 299360 Gen at creation: 147 Parent ID: 5 Top level ID: 5 Flags: - Snapshot(s): Quota group: 0/305 Limit referenced: 28.03GiB Limit exclusive: - Usage referenced: 28.03GiB Usage exclusive: 28.03GiB root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew5 virtual-machines/kubew5 Name: kubew5 UUID: 3e718300-7a51-9546-9477-074dce34eb7d Parent UUID: 8c39712e-7bc6-4548-ad8b-718fe3f165e6 Received UUID: - Creation time: 2021-06-10 13:49:41 +0200 Subvolume ID: 347 Generation: 302163 Gen at creation: 50161 Parent ID: 5 Top level ID: 5 Flags: - Snapshot(s): Quota group: 0/347 Limit referenced: 28.03GiB Limit exclusive: - Usage referenced: 10.48GiB Usage exclusive: 10.48GiB
/btrfs/virtual-machines/kubew5/root.img have a size of
I could perhaps manually increase the quota. Other workers like kubew1 and kubew3 were created at the same time as kubew2, all three have ~13GB used when checked with
df -h /. Yet their btrfs subvolume quota state 22.3GiB and 22.08GiB.
root @ node1 # btrfs subvolume show /btrfs/virtual-machines/kubew1 | tail -n 4 Limit referenced: 28.03GiB Limit exclusive: - Usage referenced: 22.36GiB Usage exclusive: 22.36GiB kubew1 ❯ df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda2 27G 13G 15G 47% / root @ node3 # btrfs subvolume show /btrfs/virtual-machines/kubew3 | tail -n 4 Limit referenced: 28.03GiB Limit exclusive: - Usage referenced: 22.08GiB Usage exclusive: 22.08GiB kubew3 ❯ df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda2 27G 15G 13G 53% /
In the end I have two questions/issues
/var/snap/lxd/common/lxd/storage-pools/btrfs empty (Leaving you with dead symlinks elsewhere) and why is the storage usage on btrfs storage backend way higher than the actually in use storage?
For the latter I’m going to guess the thin provisioned qemu disk ?
I’m gonna need to increase the quotas on my other VMs before they all run into this issue. I can live without 1-3 k8s workers but if one dies then the others get a higher load and download more images and fill up storage so before it all comes down like a domino…
I’ll not “fix” my broken kubew2 by manually increasing the quota incase anyone wants me to do some debugging/tests