BTRFS Issue: Failed preparing container for start

Jimbo · February 12, 2021, 4:54pm

I have setup an Ubuntu server LTS inside Parallels Desktop Preview for M1 using a BTRFS partition mounted at /btrfs and this is set as the default storage pool.

I am running unit test on creating a container, which works fine on the ZFS server. No matter how many times I run it, I get the same error, just a different file.

 {
    "type": "sync",
    "status": "Success",
    "status_code": 200,
    "operation": "",
    "error_code": 0,
    "error": "",
    "metadata": {
        "id": "a6538030-6f28-4e19-8269-81984d4b91a7",
        "class": "task",
        "description": "Starting instance",
        "created_at": "2021-02-12T16:37:25.422106169Z",
        "updated_at": "2021-02-12T16:37:25.424997113Z",
        "status": "Failure",
        "status_code": 400,
        "resources": {
            "instances": [
                "/1.0/instances/itest"
            ]
        },
        "metadata": {
            "container_progress": "Remapping container filesystem"
        },
        "may_cancel": false,
        "err": "Failed preparing container for start: Failed to change ownership of: /var/snap/lxd/common/lxd/storage-pools/default/containers/itest/rootfs/usr/lib/apt/apt.systemd.daily",
        "location": "none"
    }

This issue appears constantly during the unit tests, but not in normal usage, so I am guessing it the equivalent of ZFS busy errors as I am hammering it.

Any thoughts?

note. This morning when I booted up I was getting Error occurred when starting proxy device: Error: Failed to receive fd from listener process errors, in the end I had to restart to make that go away.

stgraber · February 12, 2021, 4:57pm

Can you check the output of journalctl -u snap.lxd.daemon it may have more specific errors printed.

Jimbo · February 12, 2021, 5:24pm

You are spot on, found the log full of these errors
Feb 12 16:20:10 ubuntu1 lxd.daemon[999]: Failed chown: Disk quota exceeded

So I went to check, and in the unit tests I am setting quota of 1GB and this causes for containers to fail. I ran du -h inside the ubuntu container and its 443MB, the same thing happens with a quota of 2GB despite the quota not being reached or anywhere near the thing. It only starts to work if I set 3GB. Not sure why its reporting it as full, and 2.5GB seems to large a buffer.

I create the container, then send a patch request to set the quota

{
    "architecture": "aarch64",
    "config": {
        "image.architecture": "arm64",
        "image.description": "Ubuntu focal arm64 (20210210_07:42)",
        "image.os": "Ubuntu",
        "image.release": "focal",
        "image.serial": "20210210_07:42",
        "image.type": "squashfs",
        "limits.cpu": "1",
        "limits.memory": "1GB",
        "volatile.apply_template": "create",
        "volatile.base_image": "e4f0f8445d85549cd4eb7be4afdc4f360334fe928c3f60eb909857d3d4dbe1ed",
        "volatile.eth0.hwaddr": "00:16:3e:ff:9f:b7",
        "volatile.idmap.base": "0",
        "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
        "volatile.last_state.idmap": "[]"
    },
    "devices": {
        "root": {
            "path": "/",
            "pool": "default",
            "type": "disk",
            "size": "1GB"
        }
    },
    "ephemeral": false,
    "profiles": [
        "custom-default",
        "custom-nat"
    ],
    "stateful": false,
    "description": "",
    "created_at": "2021-02-12T17:13:29.035500338Z",
    "expanded_config": {
        "image.architecture": "arm64",
        "image.description": "Ubuntu focal arm64 (20210210_07:42)",
        "image.os": "Ubuntu",
        "image.release": "focal",
        "image.serial": "20210210_07:42",
        "image.type": "squashfs",
        "limits.cpu": "1",
        "limits.memory": "1GB",
        "volatile.apply_template": "create",
        "volatile.base_image": "e4f0f8445d85549cd4eb7be4afdc4f360334fe928c3f60eb909857d3d4dbe1ed",
        "volatile.eth0.hwaddr": "00:16:3e:ff:9f:b7",
        "volatile.idmap.base": "0",
        "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
        "volatile.last_state.idmap": "[]"
    },
    "expanded_devices": {
        "eth0": {
            "name": "eth0",
            "nictype": "bridged",
            "parent": "custombr0",
            "type": "nic"
        },
        "root": {
            "path": "/",
            "pool": "default",
            "type": "disk"
        }
    },
    "name": "itest",
    "status": "Stopped",
    "status_code": 102,
    "last_used_at": "1970-01-01T00:00:00Z",
    "location": "none",
    "type": "container"
}

stgraber · February 12, 2021, 6:48pm

We’ve had a similar report earlier this week, maybe something changed in the way btrfs quotas work or maybe there’s a bug in btrfs’ quota enforcement.

Normally I’d expect the quota to account for the delta between the base image and the current instance or from the original instance its copy if using an instance copy.
But it could be that btrfs qgroup now instead applies to the entire set of referenced data?

We’ve also had some weird cases where btrfs will tell you you ran out of disk space when in fact you ran out of space for the metadata. The fact that metadata and data is tracked separately can cause a bunch of interesting issues at times. I don’t know if it’s related to what you’re seeing here though.

Ah, one last thing that may be at play. If you have snapshots, snapshots are usually counted against the quota of the parent instance. So if your instance changed a lot and is using snapshots, you could be using far more space than it seems.

What does lxc info report as far as disk usage? That value should be coming from btrfs so may be more reliable than du?

Jimbo · February 12, 2021, 6:53pm

No snapshots, this is just during the container creation process it is happening.

$ sudo btrfs fi show 
Label: none  uuid: 916f6d6a-c243-477c-b43b-20d12eb15518
	Total devices 1 FS bytes used 1.97GiB
	devid    1 size 53.50GiB used 3.02GiB path /dev/sda3

Now, I have more containers, I can’t create a container with 3GB, have to change to 4GB…

$ lxc info ubuntu
 Disk usage:
    root: 15.29MB

Jimbo · February 12, 2021, 6:54pm

I think the quota is setting on the whole hard disk usage. Maybe I set it up wrong or something, when I am setting the quota on the container, it seems to be checking the size of the storage pool. Maybe this is why 3GB no longer works.

Jimbo · February 12, 2021, 6:58pm

Note, when installing Ubuntu, I formatted the second partition as BTRFS, mounted it at /btrfs and then when I rand lxd init, I did this.

Create a new BTRFS pool? (yes/no) [default=yes]: no
Name of the existing BTRFS pool or dataset: /btrfs

Did I set it up wrong? I did not use mkfs.btrfs.

stgraber · February 12, 2021, 7:01pm

Nope, that’s perfectly fine.

Jimbo · February 12, 2021, 7:03pm

In that case it appears that the quota limit being set on containers is setting the entire pool size, since I have to constantly increase the size to create a container and this size must be higher than the total usage.

stgraber · February 12, 2021, 7:07pm

What does btrfs qgroup show -pcref /btrfs show you?

Jimbo · February 12, 2021, 7:08pm

$ sudo btrfs qgroup show -pcref /btrfs
]qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/5          16.00KiB     16.00KiB         none         none ---     ---

stgraber · February 12, 2021, 7:09pm

Ok, so far nothing weird, can you run btrfs qgroup show -pcreF /btrfs/... for each of:

/btrfs
/btrfs/containers
/btrfs/containers/NAME

Jimbo · February 12, 2021, 7:13pm

$ sudo btrfs qgroup show -pcreF /btrfs
qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/5          16.00KiB     16.00KiB         none         none ---     ---

$ sudo btrfs qgroup show -pcreF /btrfs/containers
qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/5          16.00KiB     16.00KiB         none         none ---     ---

$ sudo btrfs qgroup show -pcreF /btrfs/containers/ubuntu
qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/527       428.08MiB     26.37MiB      2.79GiB         none ---     ---

stgraber · February 12, 2021, 7:14pm

Ok, so far so good, suggests a 3GB limit applied with 428MB of usage,
What does it show for another container?

Jimbo · February 12, 2021, 7:16pm

I created a new one with a 5GB limit

$ sudo btrfs qgroup show -pcreF /btrfs/containers/ubuntu3
qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/529       424.29MiB     22.58MiB      4.66GiB         none ---     ---

stgraber · February 12, 2021, 7:17pm

Ok, so far still looks correct, it’s assigned a separate quota group and has 5GB of disk space.

Can you try the qgroup show on say rootfs/etc inside the container to see if that’s properly linked to the qgroup?

Jimbo · February 12, 2021, 7:20pm

How do I install that command?

stgraber · February 12, 2021, 7:21pm

I meant, run sudo btrfs qgroup show -pcreF /btrfs/containers/ubuntu3/rootfs/etc for example

Jimbo · February 12, 2021, 7:21pm

$ sudo btrfs qgroup show -pcreF /btrfs/containers/ubuntu3/rootfs/etc
qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/529       498.18MiB    115.23MiB      4.66GiB         none ---     ---

stgraber · February 12, 2021, 7:22pm

Ok, so that all looks correct so far.