Why does LXD kill this import process, What does "Error: Post hook: In use" even mean, and how do I solve it?

code_dredd · May 28, 2021, 11:21pm

Hi,

I need additional recommendations. I’m now running into the following import issue, and I haven’t found anything meaningful in the logs (below) or elsewhere.

Does anyone know what’s causing these import failures and how to solve them?

When using byobu, and only when using it, if I detach (i.e., F6) from my session, LXD runs the process for a bit, and then kills it:

$ lxc import --verbose /mnt/tank/arma@wednesday.tar.gz arma --storage lxd
Importing instance: 41% (32.58MB/s)[1]    3032775 terminated  lxc import --verbose /mnt/tank/arma@wednesday.tar.gz arma --storage lxd

There’s really no reason for LXD to kill the process and LXD does not seem to say why it did so anywhere. I’ve never had issues with other non-LXD commands, either.

OTOH, if I remain in the session, LXD always gets to 100%, but then fails[1] with this error message:

$ lxc import --verbose /mnt/tank/arma@wednesday.tar.gz arma --storage lxd
Error: Post hook: In use

What does that even mean? I’ve seen this multiple times already and this has cost me several days/nights now… I’ve not found anything meaningful about it and I’m growing increasingly frustrated with LXD as a whole…

For reference, this is what /var/snap/lxd/common/lxd/logs/lxd.log has:

<snip>
t=2021-05-28T16:34:30-0600 lvl=info msg="Pruning expired instance backups"
t=2021-05-28T16:34:30-0600 lvl=info msg="Done pruning expired instance backups"
t=2021-05-28T16:34:30-0600 lvl=info msg="Updating images"
t=2021-05-28T16:34:30-0600 lvl=info msg="Done updating images"
t=2021-05-28T16:47:39-0600 lvl=info msg="Creating container" ephemeral=false instance=arma instanceType=container project=default
t=2021-05-28T16:47:39-0600 lvl=info msg="Created container" ephemeral=false instance=arma instanceType=container project=default
t=2021-05-28T16:47:39-0600 lvl=info msg="Deleting container" created=2020-10-17T15:51:33-0600 ephemeral=false instance=arma instanceType=container project=default used=2021-04-02T11:04:19-0600
t=2021-05-28T16:48:16-0600 lvl=info msg="Deleted container" created=2020-10-17T15:51:33-0600 ephemeral=false instance=arma instanceType=container project=default used=2021-04-02T11:04:19-0600

That seems to just be logging what LXD did after everything failed, but nothing else. I’m re-running the import again as I write this, with the --debug option to see if anything meaningful shows up. In the meantime, I hope someone has a better idea about this than I do.

Thanks.

[1] A prior failure was that LXD would wait until the import reached 100% before checking whether requirements had been met - e.g., checking for required profiles. This is something that should be checked before wasting who knows how much of the user’s time waiting for the process to get to 100% to then fail it. When things must fail, they should fail quickly.

stgraber · May 29, 2021, 1:29am

My guess would be that your system is running out of memory and random things are getting killed causing those symptoms.

stgraber · May 29, 2021, 1:30am

dmesg may have some interesting output

code_dredd · May 29, 2021, 1:41am

I doubt it and there’re at least two reasons for it:

I’ve had htop running and memory usage hasn’t been observed to go over 75%.
As I mentioned, whenever I remain in my session, the import progress always reaches 100% and then fails with the Error: Post hook: In use message.

I’d expect that if it were a memory usage issue, I wouldn’t be observing the different behaviors I just mentioned.

Not really… only entries like these (piped to less -N, so ignore the numbers):

939 [239268.274666] audit: type=1400 audit(1622079039.021:53): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
939 ame="snap.lxd.hook.install" pid=436258 comm="apparmor_parser"
940 [239268.289587] audit: type=1400 audit(1622079039.037:54): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
940 ame="snap.lxd.activate" pid=436252 comm="apparmor_parser"
941 [239268.294904] audit: type=1400 audit(1622079039.041:55): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
941 ame="snap.lxd.buginfo" pid=436254 comm="apparmor_parser"
942 [239268.297261] audit: type=1400 audit(1622079039.045:56): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
942 ame="snap.lxd.benchmark" pid=436253 comm="apparmor_parser"
943 [239268.297703] audit: type=1400 audit(1622079039.045:57): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
943 ame="snap.lxd.hook.remove" pid=436259 comm="apparmor_parser"
944 [239268.298168] audit: type=1400 audit(1622079039.045:58): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
944 ame="snap.lxd.check-kernel" pid=436255 comm="apparmor_parser"
945 [239268.326366] audit: type=1400 audit(1622079039.073:59): apparmor="STATUS" operation="profile_replace" profile="unconfined" n
945 ame="snap.lxd.hook.configure" pid=436257 comm="apparmor_parser"
946 [239268.332682] audit: type=1400 audit(1622079039.081:60): apparmor="STATUS" operation="profile_replace" profile="unconfined" n946 ame="snap.lxd.daemon" pid=436256 comm="apparmor_parser"

Not sure if that tells you anything, but, to me, it doesn’t…

Thanks in advance.

stgraber · May 29, 2021, 2:37am

Hmm, that’s odd, the terminated is usually a good sign of some kind of resource exhaustion.

The In use I believe is coming from the fact that there is leftovers from the previous attempt.

What storage driver is being used here?

code_dredd · May 29, 2021, 4:22am

LXD is using a ZFS pool:

$ lxc storage list
+------+--------+----------+-------------+---------+
| NAME | DRIVER |  SOURCE  | DESCRIPTION | USED BY |
+------+--------+----------+-------------+---------+
| lxd  | zfs    | tank/lxd |             | 3       |
+------+--------+----------+-------------+---------+

And the list:

$ zfs list
NAME                                USED  AVAIL     REFER  MOUNTPOINT
tank                               60.7G   389G     60.7G  /mnt/tank
tank/lxd                            272K   389G       28K  /mnt/tank/lxd
tank/lxd/containers                  24K   389G       24K  /mnt/tank/lxd/containers
tank/lxd/custom                      48K   389G       24K  /mnt/tank/lxd/custom
tank/lxd/custom/default_backups      24K   168G       24K  /var/snap/lxd/common/lxd/storage-pools/lxd/custom/default_backups
tank/lxd/deleted                    124K   389G       28K  /mnt/tank/lxd/deleted
tank/lxd/deleted/containers          24K   389G       24K  /mnt/tank/lxd/deleted/containers
tank/lxd/deleted/custom              24K   389G       24K  /mnt/tank/lxd/deleted/custom
tank/lxd/deleted/images              24K   389G       24K  /mnt/tank/lxd/deleted/images
tank/lxd/deleted/virtual-machines    24K   389G       24K  /mnt/tank/lxd/deleted/virtual-machines
tank/lxd/images                      24K   389G       24K  /mnt/tank/lxd/images
tank/lxd/virtual-machines            24K   389G       24K  /mnt/tank/lxd/virtual-machines

Assuming that’s the case, how can those leftovers be completely removed from the picture so that they stop interfering?

stgraber · May 29, 2021, 4:31am

Was the backup made with --optimized-storage?

code_dredd · May 29, 2021, 4:34am

Yes. The original container from which it got exported was also a ZFS pool in the same system.

stgraber · May 29, 2021, 4:38am

okay, do you have any leftover zfs receive or zfs recv processes running?

code_dredd · May 29, 2021, 4:38am

Is there a way to track this? Does LXD actually log a reason for something like this somewhere? I’d like to avoid having to stay in front of the screen for 1hr+ just to see if the resource consumption tries to tick above some sort of limit.

On a side note, hypothetically, why would an import run out of resources on the same system that the container had been running for over a year?

code_dredd · May 29, 2021, 4:39am

Nothing shows up in the process list:

$ ps -ef | grep zfs
ray       253974   55644  0 22:38 pts/1    00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude
-dir=.hg --exclude-dir=.svn zfs

stgraber · May 29, 2021, 4:40am

The way restore works is:

CLI opens the file and sends it over the REST API to LXD
LXD streams it through a decompressor and tarball parser
tarball parser locates the binary blob for zfs and streams that to zfs receive
zfs receive finally writes the data back onto the dataset

So there are many places where a tiny mistake could cause buffering, causing the file to be entirely loaded into memory which can very quickly fail and cause the kernel to start killing stuff.

stgraber · May 29, 2021, 4:41am

What LXD version is that?

code_dredd · May 29, 2021, 4:43am

$ lxc --version
4.14
$ lxd --version
4.14

This is from the snap:

$ snap list
Name    Version   Rev    Tracking       Publisher   Notes
core18  20210507  2066   latest/stable  canonical✓  base
lxd     4.14      20450  latest/stable  canonical✓  -
snapd   2.50      11841  latest/stable  canonical✓  snapd

code_dredd · May 29, 2021, 4:51am

I’m running a new import attempt, but looking at resource usage more closely. For reference, the compressed image being imported is ~61G and the system has 8G of RAM. I’ll report back as soon as I have something, but so far, memory usage is <= 5.1G or ~64% usage.

code_dredd · May 29, 2021, 5:17am

As a quick note, I had to import a backup earlier this year, after I woke up to a corrupted SQLite database. That was a few months ago and I never ran into this particular issue.

The only differences that I can note this time around:

I uninstalled and re-installed snapd, lxd, etc. (snap remove --purge lxd commands got stuck for several days and snap abort ID never aborted the task; I couldn’t do anything, so I opted to nuke snap itself…);
I also destroyed and re-created the datasets within the ZFS pool (not the pool itself, just to make sure I wasn’t dealing with some leftover stuff);
I explicitly used the commands below to set mount points (in the previous pool, even after telling LXD to use tank/lxd, most mount points would show up as None);
Re-ran lxd init

The commands to set the mountpoints:

$ sudo zfs set mountpoint=/mnt/tank/lxd tank/lxd
$ for dataset in containers custom deleted images virtual-machines; do
    sudo zfs set mountpoint=/mnt/tank/lxd/${dataset} tank/lxd/${dataset}
done
$ for dataset in containers custom images virtual-machines; do
    sudo zfs set mountpoint=/mnt/tank/lxd/deleted/${dataset} tank/lxd/deleted/${dataset}
done

The commands were built based on the datasets that LXD itself created after being told to use tank/lxd during the lxd init phase.

tomp · May 29, 2021, 6:21am

The In Use error means that LXD thinks there is still an outstanding internal use of that volume and is preventing it from being unmounted by the post import hook (likely to resize thr volume based on the meta data inside the backup).

This was likely caused by an earlier different failure cause of the import that left a reference.

If you reload lxd that shoujd clear the reference and allow retry. Did you have any earlier failed imports due to running out of space perhaps, as that might indicate a bug that’s not cleaning up the mount reference.

code_dredd · May 29, 2021, 6:27am

So, after another 1.5hrs, it failed again with Error: Post hook: In use.

Observations:

The largest amount of memory observed being used was 5.22G (from htop);
The last relevant commands I noticed running were gzip -d and zfs receive -x mountpoint -F tank/lxd/containers/arma with the htop column of TIME+ being roughly at 20:15 for gzip and 1:37);
The ZFS pool usage was <= 75% usage;

No more than ~65% of the total RAM was used, so this shouldn’t be the problem. The ZFS pool and datasets are below, shortly before the import failure:

NAME                                USED  AVAIL     REFER  MOUNTPOINT
tank                                186G   263G     60.7G  /mnt/tank
tank/lxd                            126G   263G       28K  /mnt/tank/lxd
tank/lxd/containers                64.8G   263G       24K  /mnt/tank/lxd/containers
tank/lxd/containers/arma           64.8G   263G     64.8G  /mnt/tank/lxd/containers/arma
tank/lxd/custom                    60.7G   263G       24K  /mnt/tank/lxd/custom
tank/lxd/custom/default_backups    60.7G   107G     60.7G  /var/snap/lxd/common/lxd/storage-pools/lxd/custom/default_backups
<snip>

EDIT: Appending to this post b/c this system now decided that it doesn’t want to let me post any more unless I wait 16 hours

It’d be nice if there were better error messages, i.e. messages that are more informative and useful in general.

What do you mean by “reload lxd”? Are you talking about snap restart lxd or do you mean something else?

Yes, I had earlier failed imports, but not due to the ZFS pool/dataset running out of space. You can see the relevant output from zfs list in my previous post.

EDIT: @tomp: I don’t seem to have the option to PM you.

tomp · May 29, 2021, 6:49am

Was that after a reload of lxd?

tomp · May 29, 2021, 6:50am

You can do systemctl reload snap.lxd.daemon