Container Size keeps growing and eating the disk space everyday

I use ZFS as the storage backend and have 4 Linux containers running on it. The OS is Ubuntu20.04 and the LXD version is 4.06.
My biggest concern is that the 4 running containers are growing in size everyday and the largest one is growing by 4GB everyday. I will soon running out of disk space. Here is some outputs from zfs list for the last there days.

The output format is
NAME USED AVAIL REFER MOUNTPOINT
Day 1:
default/containers/repo 49.0G 165G 27.1G /var/snap/lxd/common/lxd/storage-pools/default/containers/repo

Day 2:
default/containers/repo 53.0G 161G 26.7G /var/snap/lxd/common/lxd/storage-pools/default/containers/repo

Day 3:
default/containers/repo 57.1G 157G 26.5G /var/snap/lxd/common/lxd/storage-pools/default/containers/repo

I have no clue on what’s going on here. Please give some advices and instructions how I can resolve this issue.
BTW, two more things I noticed.

  1. The stopped container doesn’t grow in size, only the running ones.
  2. I noticed that size differences between the column USED and the column REFER in the above zfs list outputs. When I copied the snapshot of the running container to the backup server. The size of the backup container is the same as the size shown in the column REFER not the column USED. I am not clear what it means.

//////////////////////////////////////////
Some background if helps:
I started to use LXD several years ago and the version was 2.21 on Ubuntu16.04. It have been running with no big problems and therefore I never tempted to upgrade to newer version.

Starting about 1 to 2 week ago, all the containers suddenly became no responsive. Then I noticed that the AVAIL disk space was running out. It became 0 literally. I then deleted some snapshots to make some free disk space, several hundreds MB. However, all the free space were gone the 2nd day. As a matter of fact, I noticed the free space was shrinking by several MB every a few minutes.

So I built a new server with much larger SSD and the latest Ubuntu and LXD. I copied the containers from the old server to the new one and started them again. Then I noticed that issue mentioned in the beginning of this post. Please advise and thank you!

hmm, do you maybe have snapshots accumulating on those instances?

In my old server, the size keeps growing even after I stopped auto snapshot and backup process. For new server, I am currently doing the process manually. I think that maybe I got panic and drew the conclusion too quickly. Below is the possible scenario.
The snapshot doesn't increase size immediately.
But it will increase the size eventually
The time difference tricked me to think the size increase was not related to the snapshot
I got panic because of my bad experience on the old server and I may need to cool down a little bit. Please give me a couple of days to do the further observation. Thank you!

Yes, by definition a snapshot is free at the time it’s taken but will then hold on to its state as the instance starts diverging, causing a size increase as that happens.

Hello, @stgraber, it seems that my worries are true. Without any snapshot yesterday, the container size grows again, 4GB everyday. It does not seem to be related to the snapshot.
Day 4:
default/containers/repo 61.1G 153G 26.3G /var/snap/lxd/common/lxd/storage-pools/default/containers/repo
Do you have any advices how I can tackle this issue?

Does restarting the container clear up the space?

no. stop/start containers doesn’t clean up space.

Ok, so it’s not a deleted inode taking up the space.
Is du -sch --one-file-system / in the container getting you something close to what zfs reports?

I removed option sc to get the report on each directory as following
du -h --one-file-system /
The best information I can get from this command is the total size of LXD
314G /var/snap/lxd/common
It is kinda larger than what zfs list reports as below
NAME USED AVAIL REFER MOUNTPOINT
default 297G 153G 24K none

Your container has nested containers?

no. just a simple container.

So a bit confused why you’d have a /var/snap/lxd/common directory inside your container then.

I have ZFS as the storage backend. And I kept the storage pool’s name as default during lxd init which is default as shown below.
/var/snap/lxd/common/lxd/storage-pools/default/
All the containers are under
/var/snap/lxd/common/lxd/storage-pools/default/containers/
For example, the container repo has its path in ZFS as shown below
/var/snap/lxd/common/lxd/storage-pools/default/containers/repo

Right but I asked for du -sch --one-file-system / INSIDE the container, not run on the host.

Actually I did run the command du -sch --one-file-system / inside the container as well. Below is the result. It is the same as the REFER size of the container.
27G /
27G total

Can you show zfs list -t all?

Below are the results from zfs list -t all

NAME USED AVAIL REFER MOUNTPOINT
default/containers/repo 61.1G 153G 26.2G /var/snap/lxd/common/lxd/storage-pools/default/containers/repo
default/containers/repo@snapshot-repo-bck-20210507021418 4.35G - 35.5G -
default/containers/repo@snapshot-repo-bck-20210508021416 219M - 35.5G -
default/containers/repo@snapshot-repo-bck-20210510140352 4.15G - 39.4G -
default/containers/repo@snapshot-repo-bck-20210512114000 265M - 27.1G -
default/containers/repo@snapshot-repo-bck-20210514012200 260M - 26.7G -

And I suspect you care about those snapshots so can’t just blow them away to see if they’re the ones holding back to state that’s slowly diverging, causing the increase you’re seeing?

I did remove snapshots when I experienced the issue on the old server. It didn’t stop the size increasing. On the other hand, I didn’t have enough time to observe every details since the service crashed every night and my focus was to move it out to a new server first.

So yes I can do it again but this experiment will take some days and I will need to do it carefully.

BTW, I am not sure when and how this 4GB size increase happens, slowly or all of a sudden. Every morning I check the size, the container becomes 4GB bigger than last night around as late as 2am. It stays the same size for the entire day until I go to bed at night. Then it repeats…

Yeah, what makes me think the snapshots could have something to do with holding onto that 4G of data is that you have a couple of snaphosts listed above which suspiciously report as USED size of just around 4G more than other smaller snapshots.

So whatever that 4G change is in your container, it looks like it got caught in snapshots sometimes.