Cant start virtual machine, disk quota exceeded. Cant grow disk quota or attach new storage pool either

EastonRoyce · November 3, 2022, 8:01am

Hello,

I’m a long time reader, first time poster. I got some valuable help from this article today, so I’ve decided to pay the deed forward.

I evidently ran into the same issue experienced by the OP today. Here are my findings.

Like the OP, I was attempting to use BTRFS in production (I also didn’t know ZFS was the default, so long as it was installed, or that BTRFS was a runner-up) and had also used this command to stop a VM a few times:

lxc stop -f <instancename>

What I didn’t realise was that running that command to effectively yank its virtual power cord, coupled with the fact that the file system in that VM had gone read-only (due to the aforementioned issue of using BTRFS), had caused the VM file system to become corrupt. In other words, it couldn’t start.

I had already applied this command:

lxc storage set default btrfs.mount_options=compress-force

Where ‘default’ is the name of my storage (hey, no judging).

I had also applied this command:

lxc config device set <instancename> root size.state=50GiB

Where 50GiB is twice the volume size of 25GiB

(as an aside, this command is the command needed to rescue the VM, you can do both, but this command will help on a per case basis without the performance impact of forceful compression across the entire BTRFS volume - we don’t all have M.2 drives or SSDs).

lxc config device set <instancename> root size.state=50GiB

(as a further aside, if you still want to push forward with BTRFS, I’d suggest starting fresh and applying the following command before deploying any VMs).

lxc storage set default btrfs.mount_options=compress-force

I didn’t stick around (I’m sorry) to determine if the command retrospectively fixes existing virtual machines. Perhaps someone else can chime in on this? Also, containers don’t seem to have these issues. Just VMs.

Back on track…

Still, when I started the VM, it ‘started’ but never became available.

lxc start <instancename>

Where ‘blankIPv4’ is the undetected IPv4 Address
Where ‘blankIPv6’ is the undetected IPv6 Address

lxc shell <instancename>

As per the OP, I got this message:

Error: LXD VM agent isn't currently running

The LXD VM agent isn’t running. It can’t run because the OS can’t start. The OS can’t start because the file system is corrupt. The file system is corrupt because it was shutdown uncleanly. It was shutdown uncleanly due to the disk being read only. The disk was read-only because the underlying disk quota had been exceeded by the meta data!

The rest, as they say is ~~history~~ well documented here: Linux Containers - LXD - Has been moved to Canonical

(read the fourth bullet point!)

These two posts also provided some valuable insight:

https://github.com/lxc/lxd/issues/9124

I made my discovery when I used the following command to get a virtual/serial terminal to my VM:

lxc console <instancename>

I pressed the enter key 1-2 times and discovered the VM was at the following boot prompt:

(initramfs)
(initramfs)

At this point I had already used the following command to take a backup of my VM from the host:

lxc export <instancename> instancename-backup.tar.gz

Without much more to lose, I tried to exit the prompt with:

(initramfs) exit

(I was expecting to reboot and return to the same prompt, I was hoping it would boot though!) I got the following output:

rootfs contains a file system with errors, check forced.
rootfs:
Inodes that were part of a corrupted orphan linked list found.

rootfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda2 requires a manual fsck

BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

I know this one all too well. I followed the instructions:

(initramfs) fsck /dev/sda2

I pressed the ‘y’ key a few times:

fsck from util-linux 2.37.2
e2fsck 1.46.5 (30-Dec-2021)
rootfs contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 18643 was part of the orphaned inode list. FIXED.
Inode 18666 was part of the orphaned inode list. FIXED.
Inode 18675 was part of the orphaned inode list. FIXED.
Inode 18704 was part of the orphaned inode list. FIXED.
Inode 266152 extent tree (at level 1) could be narrower. Optimize<y>? yes
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (1705636, counted=1128276).
Fix<y>? yes
Inode bitmap differences: -18632 -18643 -18666 -18675 -18704
Fix<y>? yes
Free inodes count wrong for group #1 (2036, counted=2041).
Fix<y>? yes
Free inodes count wrong (2749440, counted=2744590).
Fix<y>? yes

rootfs: ***** FILE SYSTEM WAS MODIFIED *****
rootfs: 228434/2973024 files (0.5% non-contiguous), 4949379/6077655 blocks
(initramfs)

I attempted another exit:

(initramfs) exit

The VM started and came back online!

rootfs: clean, 228434/2973024 files, 4949379/6077655 blocks

Ubuntu 22.04.1 LTS <instancename> ttyS0

<instancename> login:
Password:

From here I was able to take a full backup of my data (I’m using virtualmin), and start again. I wanted a fresh start on ZFS, so after backing up all the data in my VMs and exporting them onto the host, I removed lxd:

sudo snap remove lxd

2022-11-03T12:50:11+11:00 INFO Waiting for "snap.lxd.daemon.service" to stop.
Save data of snap "lxd" in automatic snapshot set #3
lxd removed

This took a little while. Then I listed the saved snapshot (I didn’t want a rerun of what I had just experienced when I reinstalled LXD).

sudo snap saved

Set Snap Age Version Rev Size Notes
3 lxd 27.1m 5.7-c62733b 23889 13.9GB auto

I noted the number of the snapshot (3 in my case) and issued the following command:

sudo snap forget 3

Snapshot #3 forgotten.

I checked that it was really gone:

sudo snap saved

No snapshots found.

I’m using Pop!_OS (), so I needed to install ZFS and reboot.

sudo apt install zfsutils-linux zfs-dkms

The zfs-dkms package is especially important. Many articles on the web just specify zfsutils-linux when you Google: how to install zfs on Pop!_OS.

You will receive a notice with regards to the license model of the kernel vs ZFS. Be mindful and thoughtful and then continue.

Once the installation was complete, I rebooted and installed LXD again.

sudo snap install lxd --channel=latest/stable

lxd 5.7-c62733b from Canonical✓ installed

I rebooted once more and then ran the lxd init utility.

lxd init

ZFS was available! Yay!

Would you like to use LXD clustering? (yes/no) [default=no]: no
Do you want to configure a new storage pool? (yes/no) [default=yes]: yes
Name of the new storage pool [default=default]: default

–

Name of the storage backend to use (ceph, cephobject, dir, lvm, zfs, btrfs) [default=zfs]: zfs
Create a new ZFS pool? (yes/no) [default=yes]: yes

–

Would you like to use an existing empty block device (e.g. a disk or partition)? (yes/no) [default=no]: no
Size in GiB of the new loop device (1GiB minimum) [default=30GiB]: 450GiB
Would you like to connect to a MAAS server? (yes/no) [default=no]: no
Would you like to create a new local network bridge? (yes/no) [default=yes]: yes
What should the new bridge be called? [default=lxdbr0]: lxdbr0
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 1.2.3.4/24
Would you like LXD to NAT IPv4 traffic on your bridge? [default=yes]: yes
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: none
Would you like the LXD server to be available over the network? (yes/no) [default=no]: no
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: yes
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: yes

Finally, I imported my VMs

lxc import <instancename> instance name-backup.tar.gz

Importing instance: 100% (141.10MB/s)

These took a while (spinning rust, I know right!)

I needed to recreate my networks and a few other things afterwards, but you get the idea.

What did I learn today?

Read the fine manual. Read it well.

P.S. This is great too: