A inport task in LXD is taking too long

danielhz · February 13, 2021, 9:15am

Yesterday I exported a container in one machine and then started importing it in an identical machine. The exporting process only takes some minutes, but the importing process is still loading. Furthermore, the importing process reaches the 100% in a few minutes, but do not finish.

$ lxc backup.tar.gz
Importing instance: 100% (61.53MB/s)

I see that the system load average is around 4, but the CPU usage is low:

%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 95.1 id,  4.7 wa,  0.0 hi,  0.0 si,  0.0 st

In the same machine where I am importing the container, I have another running container, but is not doing much work. The issue with that container is that I cannot exec commands in it. I get the following message:

$ lxc exec container bash
Error: failed to add "Executing command" Operation 18f30ef9-e3bf-4381-94cd-c1b8c911de49 to database: no more rows available

Maybe is a problem related with the fact of the that importing process does not end.

I am using LXD 4.11 installed with snap in an Ubuntu 18.4.5 machine.

danielhz · February 13, 2021, 1:54pm

Because it was taking too long, I decided to stop the importing operation by pressing CONTROL+C in the terminal that have launched the task. The issue is that this operation did not stop the importing process.

$ lxc import sparql-prov-tpch.tar.gz
Importing instance: 100% (61.53MB/s)
Error: User signaled us three times, exiting. The remote operation will keep running

How can I stop the remote operation. I see some processes that could be related to this operation, but I guess that they must be one process that suffices to stop the importing task and end without a broken LXD system.

By the way, I still have an inaccessible container. I tried to stop it with:

lxc stop container
Error: failed to add "Stopping instance" Operation 76a57a6b-77dc-4206-a90d-f15d407d8dff to database: no more rows available

What it means “no more rows available”?

stgraber · February 13, 2021, 11:07pm

Sounds like your system isn’t doing super well, are you running out of disk space or memory?

danielhz · February 14, 2021, 10:46am

I have no control on how the system is configured, and I cannot modify the system. Instead, I am creating containers and nested containers to install software and run the jobs according to my needs. The first issue I had is that some partitions are too small: var and tmp. These partitions have only few gigabytes, but I am creating containers that hold hundreds of gigabytes. In only have a big partition of seven terabytes in ext4 format.

First, I tried to use the dir storage driver in the big partition, but then I noticed that some operations as copying containers were super slow. Then, I created a loop disk of 3 TB and I formatted it as a btrfs disk. I mounted a directory in this loop disk on /var/snap. Now copying containers is faster, and I can make snapshots, but I had some issues with other operations.

One is the aforementioned issue of importing a container. I have two machines: machine A and machine B. In machine A, I have a container Ca with several nested containers. I exported container Ca, and then I tried to load the backup of Ca in machine B. The backup of container Ca has only 200 GB, so I expected it wouldn’t take that long as I mentioned in the first post. Then, I tried with a workaround. I exported every single nested container in Ca; I created an empty container Cb in machine B; and then I copy each backup of the nested containers of Ca to Cb. I have no problem importing these containers. So, the workaround works, but it is less elegant and requires more work from my side.

Another issue is when copying the nested containers from Ca to Cb. I copy the files from the container Ca to machine B using scp, and then I tried to use lxc file push to copy the containers from machine B to container Cb. The problem is that in this step I got and error because partition /tmp got full. Why the directory /tmp is used for this operation? As I already mentioned, partition on /tmp is only a few gigabytes. Then, I follow another workaround. I simply copy the backup files to the directory /var/snap/lxd/common/lxd/containers/Cb/rootfs/home/ubuntu/backups and then change the owner of the directories. For the next time I will simply mount the directory where I have the backups on the parent container Cb (as is described in https://www.cyberciti.biz/faq/how-to-add-or-mount-directory-in-lxd-linux-container/).

danielhz · February 14, 2021, 4:03pm

I have discovered something regarding the loop disk that may be related with the issue regarding the long time importing a disk takes. I did the following test in the same machine:

Copy 1 GB of zeros in a file inside an ext4 partition:

$ time dd if=/dev/zero of=path1/test.img bs=1024k count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.64811 s, 124 MB/s

real    0m8.764s
user    0m0.000s
sys     0m0.836s

Copy a GB of zeros to a btrfs loop disk whose file is in the ext4 of the previous example.

$ time dd if=/dev/zero of=path2/test.img bs=1024k count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.719496 s, 1.5 GB/s

real    0m0.860s
user    0m0.000s
sys     0m0.859s

Contrary to my expectation, the second operation was faster. However, I monitored the IO with iotop and, in the second experiment, the IO continues some seconds after the end of the command. So, I guess that the drive tell the system that the data was written to the disk when it was not actually written. This may be the reason of that I got the 100% when loading the disk and then several hours waiting to complete the load.

My second guess is that in the importing operation some data was spilled to the /tmp partition (or other small partition) and it got out of space. When I import the backup of a nested container the data is spilled to the /tmp partition of the parent container, which lives in the big disk partition, instead of the small disk partitions of the host system.

stgraber · February 14, 2021, 4:45pm

Try adding conv=fdatasyncto your dd runs, otherwise you’re pretty much just measuring how fast your kernel can cache data

danielhz · February 14, 2021, 8:33pm

Thanks @stgraber! I repeated the experiment with more data 10G).

In the ext4 partition:
70.3668 s, 153 MB/s
68.8876 s, 156 MB/s

In the btrfs loop disk:
96.2231 s, 112 MB/s
100.217 s, 107 MB/s

This is thus an estimation of the penalty because using a loop disk.