Disk space problem on lxd containers with zfs pool increases and does not decrease

superflay123 · January 17, 2023, 3:49pm

I have a problem with disk space on a lxd host server. Two containers are running on this server, one is a proxy server, the other is with sites. both containers have a default zfs pool with a total capacity of 15GB. each container has its own limited proxy disk space - 3GB, sites 12GB. snapshots and tar archives are taken on these two containers every day, and each snapshot has an expiration date of 7 days. In the evening, containers with lxd copies are transferred from the productive server to a backup server with lxd mosaic.

My problem is the containers on the production server are running out of disk space.

Default storage pool space:
space used: 11.93GiB
total space: 14.41GiB

sites root disk size is . root: 10.23GiB
proxy root disk size is . root: 1.68GiB

Name: sites
Status: RUNNING
Type: container
Architecture: x86_64
PID: 15278
Created: 2022/09/05 13:59 UTC
Last Used: 2022/12/01 19:04 UTC

Resources:
Processes: 120
Disk usage:
root: 10.24GiB
CPU usage:
CPU usage (in seconds): 100563
Memory usage:
Memory (current): 384.51MiB
Memory (peak): 1.40GiB
Network usage:
eth0:
Type: broadcast
State: UP
Host interface: veth08317273
MAC address: 00:16:3e:79:76:39
MTU: 1500
Bytes received: 5.56GB
Bytes sent: 14.49GB
Packets received: 34152548
Packets sent: 29411480
IP addresses:
inet: 10.148.154.110/24 (global)
inet6: fd42:dc6b:5124:630c:216:3eff:fe79:7639/64 (global)
inet6: fe80::216:3eff:fe79:7639/64 (link)
lo:
Type: loopback
State: UP
MTU: 65536
Bytes received: 158.58MB
Bytes sent: 158.58MB
Packets received: 1527285
Packets sent: 1527285
IP addresses:
inet: 127.0.0.1/8 (local)
inet6: ::1/128 (local)

Snapshots:
±----------------------±---------------------±---------------------±---------+
| NAME | TAKEN AT | EXPIRES AT | STATEFUL |
±----------------------±---------------------±---------------------±---------+
| snapshot-1111111023-0 | 2023/01/11 01:19 UTC | 2023/01/18 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+
| snapshot-1212121023-0 | 2023/01/12 01:19 UTC | 2023/01/19 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+
| snapshot-1313131023-0 | 2023/01/13 01:19 UTC | 2023/01/20 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+
| snapshot-1414141023-0 | 2023/01/14 01:19 UTC | 2023/01/21 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+
| snapshot-1515151023-0 | 2023/01/15 01:19 UTC | 2023/01/22 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+
| snapshot-1616161023-0 | 2023/01/16 01:19 UTC | 2023/01/23 01:19 UTC | NO |
±----------------------±---------------------±---------------------±---------+

NAME USED AVAIL REFER MOUNTPOINT
default/containers/sites@migration-93f59a63-f099-422d-9b5d-a3c207729411 296M - 2.50G -
default/containers/sites@migration-e2d3e821-8dde-45ad-bd01-4b01699598af 82.6M - 2.64G -
default/containers/sites@migration-61aa56db-39d2-4e03-81b4-3c4eea95ee83 7.88M - 2.38G -
default/containers/sites@migration-27946a23-f10b-4e96-966b-710a5df1108b 7.96M - 2.51G -
default/containers/sites@migration-16d62f87-8483-42cd-8d97-4b63fede8a1a 77.5M - 2.64G -
default/containers/sites@migration-e5edfa48-1429-4f01-af09-7e0ffad6db08 76.6M - 2.77G -
default/containers/sites@migration-fcae50bb-cbe1-43b4-9592-aa0cc46a7dae 73.9M - 2.90G -
default/containers/sites@migration-10ac5b99-39e8-4a1e-9894-d24cbf9a6a55 74.8M - 3.03G -
default/containers/sites@migration-6ad928f3-05dc-4bf0-8513-4699585be238 73.9M - 3.16G -
default/containers/sites@migration-f96c2d63-8bbc-4e24-9b15-02a4ca7e8b1b 68.2M - 3.16G -
default/containers/sites@migration-ff54f0b7-63e6-4f8d-8719-9e2b05bc8ed2 67.6M - 3.29G -
default/containers/sites@migration-95a473e1-b3ac-433b-af39-716d8f5ea567 205M - 3.42G -
default/containers/sites@snapshot-snapshot-1111111023-0 79.6M - 5.27G -
default/containers/sites@snapshot-snapshot-1212121023-0 77.4M - 5.40G -
default/containers/sites@snapshot-snapshot-1313131023-0 82.1M - 5.53G -
default/containers/sites@snapshot-snapshot-1414141023-0 72.1M - 5.67G -
default/containers/sites@snapshot-snapshot-1515151023-0 13.0M - 4.99G -
default/containers/sites@snapshot-snapshot-1616161023-0 11.2M - 5.12G -
default/containers/proxy@migration-b52819ad-22db-4a3f-880e-40df55f79ab9 62.6M - 528M -
default/containers/proxy@migration-21bb0807-3c9c-45b4-b692-d93969fffcf2 62.0M - 528M -
default/containers/proxy@migration-34d47244-ee34-469f-a59a-b51d49a5108c 63.9M - 528M -
default/containers/proxy@migration-9f4d3ff8-f55d-4d15-aa11-d84dea659bb3 58.3M - 528M -
default/containers/proxy@migration-d48226e2-3c87-4dcd-99ae-5b5dbd561430 58.7M - 528M -
default/containers/proxy@migration-9f5ac97f-b523-4b51-80de-1bec3ad7f702 2.62M - 529M -
default/containers/proxy@migration-c1f77c64-988b-43ee-8186-61694c0ba571 2.65M - 529M -
default/containers/proxy@migration-90c4bf05-a086-4db2-af10-63572afd9386 73.0M - 530M -
default/containers/proxy@migration-c4124144-b354-4cee-a2f2-05ab97e86109 70.9M - 536M -
default/containers/proxy@migration-7c8f440b-a58a-4850-8657-ac9fded3f126 66.3M - 531M -
default/containers/proxy@migration-55dc2f94-f770-4fbe-a1d0-f9d65deb7b5a 56.1M - 529M -
default/containers/proxy@migration-4ff8cebb-6ed2-4183-9a6c-d7c8b27d6e92 56.6M - 529M -
default/containers/proxy@snapshot-snapshot-1111111023-0 74.6M - 530M -
default/containers/proxy@snapshot-snapshot-1212121023-0 2.14M - 531M -
default/containers/proxy@snapshot-snapshot-1313131023-0 2.51M - 533M -
default/containers/proxy@snapshot-snapshot-1414141023-0 75.0M - 537M -
default/containers/proxy@snapshot-snapshot-1515151023-0 57.2M - 539M -
default/containers/proxy@snapshot-snapshot-1616161023-0 57.1M - 537M -

when i deleted all the snapshots the disk space did not decrease. The problem is not inside the container, I have certainly limited the log level, I have repeatedly searched for problems inside, but I do not find anything that increases the disk space.

the only thing i noticed is these files that are created in the zfs file system

default/containers/proxy@migration-55dc2f94-f770-4fbe-a1d0-f9d65deb7b5a

default/containers/sites@migration-93f59a63-f099-422d-9b5d-a3c207729411 296M - 2.50G -

please help and advise what are these files and how to proceed, can I delete them with zfs destroy and how do I know which ones to deleteп

tomp · January 18, 2023, 3:01pm

There was a bug in LXD that left behind those optimized volume @migration- snapshots when doing a migration.

github.com/lxc/lxd

Storage: Don't attempt multi-sync mode optimized transfers

lxc:master ← tomponline:tp-storage-optimized-transfer

opened 03:06PM - 05 Dec 22 UTC

tomponline

+81 -69

The ZFS and BTRFS storage drivers were already not doing optimised transfers in …the final stage of multi-sync mode (just returning nil). Which causes ZFS temporary snapshots to be left behind. So make this invocation type an error, and detect the use of non-optimized transfer mode earlier to avoid using MultiSync=true in the first place. This then resolves the issue of the temporary snapshots not being cleaned up on the source. Instead of https://github.com/lxc/lxd/pull/11193 Fixes #11194

So they can be manually deleted.

What LXD version are you using?

superflay123 · January 19, 2023, 6:13pm

Thanks for the fast answer @tomp . I want to ask how to fix this bug or if not how to determine which @migration files to delete. I guess I can do it with a zfs destroy.

#sudo lxd --version
[sudo] password for sites:
5.10

tomp · January 20, 2023, 9:46am

You can delete all of the @migration ones. the current LXD shouldn’t be leaving anymore of them of them behind.

superflay123 · January 24, 2023, 9:53am

how to do that ?

The ZFS and BTRFS storage drivers were already not doing optimised transfers in the final stage of multi-sync mode (just returning nil). Which causes ZFS temporary snapshots to be left behind. So make this invocation type an error, and detect the use of non-optimized transfer mode earlier to avoid using MultiSync=true in the first place.

This then resolves the issue of the temporary snapshots not being cleaned up on the source.

superflay123 · January 25, 2023, 6:29am

When i delete all @migration file I cannot copy container to backend server -

$ lxc copy --mode=push --refresh --stateless easyliving office:easyliving --verbose
Error: Failed creating instance record: Unknown configuration key: volatile.last_state.ready
what can i do now

tomp · January 25, 2023, 8:06am

I think that is unrelated error, it suggests the target server is older than the source, see:

github.com/lxc/lxd

Failed instance creation: Unknown configuration key: volatile.uuid

opened 06:04PM - 16 Dec 20 UTC

closed 08:34PM - 16 Dec 20 UTC

jamielsharief

Ubuntu 20.04 LXD 4.9 (VirtualBox) ZFS storage pool on a custom partition S…teps: I create an Alpine container, stop it, and then try to move it ``` $ lxc launch images:alpine/3.11/amd64 c2 $ lxc stop c2 $ lxc move c2 server2:c2 ``` I get the error `Failed instance creation: Unknown configuration key: volatile.uuid` Looking at the [source code](https://github.com/lxc/lxd/blob/master/lxc/move.go#L193) for `move` it says *keep the volatile entries* and the `--debug` shows that it is sending it . When using the API I have to remove those for it to work. Yesterday I also ran into random errors with with `Error transferring instance data: Unknown configuration key: volatile.uuid` and it turns out my ZFS store was getting locked, and i had to restart. I thought I caused this by using CTRL-C in the command line when trying to copy or move. I am not sure if this is related but the errors were similar. Previously my code was working through the API, but i was not using a custom ZFS storage pool. I just saw on discuss that 4.9 was released yesterday so i thought I would post here.

superflay123 · January 25, 2023, 11:20am

It’s realy was older version of lxd 5.0.2. Front end server is lxd 5.10 with Ubuntu 18.04.6 LTS and backup server is 5.0.2 with Ubuntu 22.04.1 LTS

i successfully transferred one container but the other is giving an error i don’t know what the problem is anymore

lxc copy proxy office:proxy-t

Error: Failed instance creation: Error transferring instance data: migration dump failed
(00.210335) Error (criu/sk-netlink.c:77): netlink: The socket has data to read
(00.210371) Error (criu/cr-dump.c:1635): Dump files (pid: 3372) failed with -1
(00.233758) Error (criu/cr-dump.c:2053): Dumping FAILED.

or

$ lxc copy --mode=push --refresh --stateless proxy office:proxy --verbose

Error: User signaled us three times, exiting. The remote operation will keep running
$

hi doing nothing

tomp · January 25, 2023, 11:30am

Live migration for containers doesn’t work currently.

You should disable CRIU on both systems using:

sudo snap unset criu.enable
sudo systemctl reload snap.lxd.daemon

tomp · January 25, 2023, 11:31am

Separately it looks like --stateless and --refresh aren’t working when combined.

superflay123 · January 25, 2023, 12:28pm

I execute command llxc copy --mode=push --refresh --stateless easyliving office:easyliving --verbose its not copy container to remote server. steel does nothing only black screen, then i execute to stop him and execute

lxc copy --mode=push proxy office:proxy --verbose

But I see its create some container with same name on remote server with the front commands it didn’t even create the name and apparently the command doesn’t end for some reason. I stopped the command after 30 minutes of waiting. I logged into the remote server where I need to copy the container, I tried to run the container but it won’t start. I deleted the newly created name from the command lxc copy --mode=push proxy office:proxy --verbose

then i ran the command
lxc copy --mode=push --refresh --stateless proxy office:proxy --verbose
which completed successfully

tomp · January 25, 2023, 12:47pm

Does it work from a LXD 5.0.2 server to LXD 5.0.2 backup server?

We don’t generally support migrating to an older server.

superflay123 · January 25, 2023, 2:16pm

copying from newer to older server worked perfectly(from 5.10 to 5.0.2) but the problem was those @migration files
now I updated the backup server and leveled the versions but after deleting the migration files I couldn’t copy from primary to secondary server when I type the command
lxc copy --mode=push --refresh --stateless proxy office:proxy --verbose didn’t do anything didn’t even create the name in the backend server
I removed --refresh --stateless from the command
lxc copy --mode=push proxy office:proxy --verbose
then it created the backend server name but it didn’t copy the container i waited about 30 min to see if the command will complete. after seeing that there was no result, I stopped the command, deleted the name of the container from the backend server, restarted the copying with
lxc copy --mode=push --refresh --stateless proxy office:proxy --verbose
the container is copied successfully