Hello,
My setup exists out of 3 nodes with ubuntu 20.04 HWE Kernel, snap LXD 4.17 and a Ceph 15 cluster. Have had this setup for ~1.5 years now. Very happy with LXD 4.x
All VMs/containers run Ubuntu 20.04 HWE with the exception of 2 which are out of scope for this issue.
Some VMs have their storage on Ceph, clustered VMs have their storage on a local disk with btrfs mounted in /btrfs
One of my VMs was having trouble, /
was mounted as ro and i/o errors in kernel logs so I shut it down. Now it is unable to start.
root @ node2 # lxc start kubew2
Error: Failed to create file "/var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml": open /var/snap/lxd/common/lxd/virtual-machines/kubew2/backup.yaml: disk quota exceeded
Try `lxc info --show-log kubew2` for more info
root @ node2 # lxc info --show-log kubew2
Name: kubew2
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: node2
Created: 2021/05/24 13:53 UTC
Last Used: 2021/08/13 11:40 UTC
Error: open /var/snap/lxd/common/lxd/logs/kubew2/qemu.log: no such file or directory
It tries to write backup.yaml
in /var/snap/lxd/common/lxd/virtual-machines/kubew2/
but that directory does not exist.
ls: cannot access '/var/snap/lxd/common/lxd/virtual-machines/kubew2/': No such file or directory
All symlinks in /var/snap/lxd/common/lxd/virtual-machines
leading to the btrfs storage pool are dead symlinks on all my three nodes
root @ node2 # ls /var/snap/lxd/common/lxd/virtual-machines -l
total 20
lrwxrwxrwx 1 root root 67 May 24 16:07 kube2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kube2
lrwxrwxrwx 1 root root 68 May 24 16:07 kubew2 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew2
lrwxrwxrwx 1 root root 68 Jun 10 13:49 kubew5 -> /var/snap/lxd/common/lxd/storage-pools/btrfs/virtual-machines/kubew5
lrwxrwxrwx 1 root root 66 May 12 09:16 plex -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/plex
lrwxrwxrwx 1 root root 59 Jan 25 2021 smb1 -> /var/snap/lxd/common/lxd/storage-pools/ceph/containers/smb1
lrwxrwxrwx 1 root root 74 Mar 1 11:47 transmission2 -> /var/snap/lxd/common/lxd/storage-pools/ceph/virtual-machines/transmission2
Because they lead to /var/snap/lxd/common/lxd/storage-pools/btrfs
which is empty on all my 3 nodes
root @ node2 # ls /var/snap/lxd/common/lxd/storage-pools/btrfs -la
total 8
drwx--x--x 2 root root 4096 May 24 14:38 .
drwx--x--x 5 root root 4096 May 24 14:38 ..
Lucky me this VM that refuses to start is a kubernetes worker node that I can live without.
Disk usage inside my VM is ~42%. The size is 30GB and it only utilized ~12GB
However btrfs thinks otherwise. If Iām not mistaken it thinks the full 28.03GiB have been used.
root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew2
virtual-machines/kubew2
Name: kubew2
UUID: a880e030-233b-d84b-9c0d-9723ce7ba096
Parent UUID: -
Received UUID: -
Creation time: 2021-05-24 15:53:45 +0200
Subvolume ID: 305
Generation: 299360
Gen at creation: 147
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/305
Limit referenced: 28.03GiB
Limit exclusive: -
Usage referenced: 28.03GiB
Usage exclusive: 28.03GiB
root @ node2 # btrfs subvolume show /btrfs/virtual-machines/kubew5
virtual-machines/kubew5
Name: kubew5
UUID: 3e718300-7a51-9546-9477-074dce34eb7d
Parent UUID: 8c39712e-7bc6-4548-ad8b-718fe3f165e6
Received UUID: -
Creation time: 2021-06-10 13:49:41 +0200
Subvolume ID: 347
Generation: 302163
Gen at creation: 50161
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/347
Limit referenced: 28.03GiB
Limit exclusive: -
Usage referenced: 10.48GiB
Usage exclusive: 10.48GiB
both /btrfs/virtual-machines/kubew2/root.img
and /btrfs/virtual-machines/kubew5/root.img
have a size of 30000005120
I could perhaps manually increase the quota. Other workers like kubew1 and kubew3 were created at the same time as kubew2, all three have ~13GB used when checked with df -h /
. Yet their btrfs subvolume quota state 22.3GiB and 22.08GiB.
root @ node1 # btrfs subvolume show /btrfs/virtual-machines/kubew1 | tail -n 4
Limit referenced: 28.03GiB
Limit exclusive: -
Usage referenced: 22.36GiB
Usage exclusive: 22.36GiB
kubew1 ⯠df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 27G 13G 15G 47% /
root @ node3 # btrfs subvolume show /btrfs/virtual-machines/kubew3 | tail -n 4
Limit referenced: 28.03GiB
Limit exclusive: -
Usage referenced: 22.08GiB
Usage exclusive: 22.08GiB
kubew3 ⯠df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 27G 15G 13G 53% /
In the end I have two questions/issues
Why is /var/snap/lxd/common/lxd/storage-pools/btrfs
empty (Leaving you with dead symlinks elsewhere) and why is the storage usage on btrfs storage backend way higher than the actually in use storage?
For the latter Iām going to guess the thin provisioned qemu disk ?
Iām gonna need to increase the quotas on my other VMs before they all run into this issue. I can live without 1-3 k8s workers but if one dies then the others get a higher load and download more images and fill up storage so before it all comes down like a dominoā¦
Iāll not āfixā my broken kubew2 by manually increasing the quota incase anyone wants me to do some debugging/tests