Why does LXD kill this import process, What does "Error: Post hook: In use" even mean, and how do I solve it?

okay, do you have any leftover zfs receive or zfs recv processes running?

Is there a way to track this? Does LXD actually log a reason for something like this somewhere? I’d like to avoid having to stay in front of the screen for 1hr+ just to see if the resource consumption tries to tick above some sort of limit.

On a side note, hypothetically, why would an import run out of resources on the same system that the container had been running for over a year?

Nothing shows up in the process list:

$ ps -ef | grep zfs
ray       253974   55644  0 22:38 pts/1    00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude
-dir=.hg --exclude-dir=.svn zfs

The way restore works is:

  • CLI opens the file and sends it over the REST API to LXD
  • LXD streams it through a decompressor and tarball parser
  • tarball parser locates the binary blob for zfs and streams that to zfs receive
  • zfs receive finally writes the data back onto the dataset

So there are many places where a tiny mistake could cause buffering, causing the file to be entirely loaded into memory which can very quickly fail and cause the kernel to start killing stuff.

What LXD version is that?

$ lxc --version
4.14
$ lxd --version
4.14

This is from the snap:

$ snap list
Name    Version   Rev    Tracking       Publisher   Notes
core18  20210507  2066   latest/stable  canonical✓  base
lxd     4.14      20450  latest/stable  canonical✓  -
snapd   2.50      11841  latest/stable  canonical✓  snapd

I’m running a new import attempt, but looking at resource usage more closely. For reference, the compressed image being imported is ~61G and the system has 8G of RAM. I’ll report back as soon as I have something, but so far, memory usage is <= 5.1G or ~64% usage.

As a quick note, I had to import a backup earlier this year, after I woke up to a corrupted SQLite database. That was a few months ago and I never ran into this particular issue.

The only differences that I can note this time around:

  1. I uninstalled and re-installed snapd, lxd, etc. (snap remove --purge lxd commands got stuck for several days and snap abort ID never aborted the task; I couldn’t do anything, so I opted to nuke snap itself…);
  2. I also destroyed and re-created the datasets within the ZFS pool (not the pool itself, just to make sure I wasn’t dealing with some leftover stuff);
  3. I explicitly used the commands below to set mount points (in the previous pool, even after telling LXD to use tank/lxd, most mount points would show up as None);
  4. Re-ran lxd init

The commands to set the mountpoints:

$ sudo zfs set mountpoint=/mnt/tank/lxd tank/lxd
$ for dataset in containers custom deleted images virtual-machines; do
    sudo zfs set mountpoint=/mnt/tank/lxd/${dataset} tank/lxd/${dataset}
done
$ for dataset in containers custom images virtual-machines; do
    sudo zfs set mountpoint=/mnt/tank/lxd/deleted/${dataset} tank/lxd/deleted/${dataset}
done

The commands were built based on the datasets that LXD itself created after being told to use tank/lxd during the lxd init phase.

The In Use error means that LXD thinks there is still an outstanding internal use of that volume and is preventing it from being unmounted by the post import hook (likely to resize thr volume based on the meta data inside the backup).

This was likely caused by an earlier different failure cause of the import that left a reference.

If you reload lxd that shoujd clear the reference and allow retry. Did you have any earlier failed imports due to running out of space perhaps, as that might indicate a bug that’s not cleaning up the mount reference.

So, after another 1.5hrs, it failed again with Error: Post hook: In use.

Observations:

  1. The largest amount of memory observed being used was 5.22G (from htop);
  2. The last relevant commands I noticed running were gzip -d and zfs receive -x mountpoint -F tank/lxd/containers/arma with the htop column of TIME+ being roughly at 20:15 for gzip and 1:37);
  3. The ZFS pool usage was <= 75% usage;

No more than ~65% of the total RAM was used, so this shouldn’t be the problem. The ZFS pool and datasets are below, shortly before the import failure:

NAME                                USED  AVAIL     REFER  MOUNTPOINT
tank                                186G   263G     60.7G  /mnt/tank
tank/lxd                            126G   263G       28K  /mnt/tank/lxd
tank/lxd/containers                64.8G   263G       24K  /mnt/tank/lxd/containers
tank/lxd/containers/arma           64.8G   263G     64.8G  /mnt/tank/lxd/containers/arma
tank/lxd/custom                    60.7G   263G       24K  /mnt/tank/lxd/custom
tank/lxd/custom/default_backups    60.7G   107G     60.7G  /var/snap/lxd/common/lxd/storage-pools/lxd/custom/default_backups
<snip>

EDIT: Appending to this post b/c this system now decided that it doesn’t want to let me post any more unless I wait 16 hours :face_with_symbols_over_mouth:

It’d be nice if there were better error messages, i.e. messages that are more informative and useful in general.

What do you mean by “reload lxd”? Are you talking about snap restart lxd or do you mean something else?

Yes, I had earlier failed imports, but not due to the ZFS pool/dataset running out of space. You can see the relevant output from zfs list in my previous post.

EDIT: @tomp: I don’t seem to have the option to PM you.

Was that after a reload of lxd?

You can do systemctl reload snap.lxd.daemon

Do you recall what error you got before the in use errors? You can private message me I think whilst you’re in your new user daily limit.

I notice you were the author of How can I control the file system path used by lxc import commands to avoid failures? suggesting you originally run out of disk space. This gives me a useful hint as to what left the orphaned internal reference counter which is now preventing import completion.

The systemctl reload snap.lxd.daemon command allowed the import process to complete successfully, so whatever LXD was failing to clean up got cleaned up by reloading the process. That seems like a bug. (Also, having more useful/informative error messages would be more helpful.)

I’m answering the rest of the questions just for reference.

The only errors I had observed follow:

  1. Error about running out of disk space. This was before my first topic on how to get LXD to use the ZFS pool for imports. (Note that my previous comment specified that I hadn’t run out of space while using the ZFS pool. They’re also on different mountpoints.)

  2. Error about missing profile. (I manually created the profile and tried again. Would be nice if this could be checked very early in the process; I ran into this at least 15 minutes after launch.)

  3. The terminated error in the OP, which shows up whenever I detach from my byobu session.

  4. The Error: Post hook: In use.

I’ve not observed any other error messages besides these 4. Errors #1 and #2 are solved already, #3 has no explanation but is very easy to reproduce, and #4 is the one I was stuck on.

Yes, but that was when LXD was using an unexpected location in the root file system instead of the ZFS pool. Prior to using the ZFS pool/dataset, it’d get to ~14% before running the USB drive out of space.

Yes I suspect the bug was caused when the import process was killed when you disconnected from the terminal. I’ll try and recreate that using screen or tmux too. I suspect what ever is killing it is somehow stopping the cleanup of resources created during the transfer taking place.

We could change the ‘in use’ error to ‘volume in use’ which is all LXD really knows about that error.

Glad you got it resolved though.

Thanks. I marked your command as the solution to the original issue.

Also, I noticed that if I enter a container, i.e. lxc exec <container> -- zsh, as soon as I F6 from my byobu session, something happens, and when you get back, the terminal you had left inside the LXC is all messed up. You basically have no choice but to exit and then open a new PTY via F2. I had seen this before (months ago) and was somewhat rarer, but now it’s almost predictable.

I’ve just tried using byobu to start a local terminal and then creating a container and installing zsh inside it.

I then ran lxc exec <container> -- zsh inside the byobu session and pressed F6 to detach.

When I reattached using byobu command the zsh shell inside the container was still running and usable. I repeated this a few times without being able to see any issues.

I also tried exporting and importing a ZFS backed container using optimized and unoptimized storage and while it was importing pressing F6 to detach and left it detached until the LXD debug logs showed the import succeeded OK. I then reattached and the container was imported fine.

I’ll look into whether it is possible that a failed import can leave an orphaned volume in-use counter around, but I’ve not been able to recreate the issues with byobu.

What OS and version are you running, I’m wondering if you have a version of systemd that is killing your running commands when you detach from the shell.

This bug was borne out of a discussion on the #lxc channel recently and I just thought maybe could be partially related: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989317

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

$ systemd --version
systemd 245 (245.4-4ubuntu3.6)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD+IDN2 -IDN +PCRE2 default-hierarchy=hybrid

I’ve recorded a demo of the problem. The unlisted video can be seen here: https://youtu.be/dJcaTYpZ4PQ

I tried one where I reconnected immediately after detaching and was surprised to not see the problem. For the demo clip, I waited ~23 seconds before reconnecting and the problem was observed. (I edited out the wait time in-between commands to save time, but no other edits were made.)

I believe the profile error you got occurred during the import stage where the instance’s config is restored into the LXD database. This stage occurs after the instance has been unpacked to the storage pool as in cases where the backup is using optimized storage format we cannot access the instance’s backup.yaml earlier because it is stored inside the optimized binary blob from the storage layer.

This is why that error cannot be reported earlier in the process currently.

I also believe that is where the bug occurred that prevented future imports.
In order to prevent volumes from being incorrectly unmounted when there are ongoing operations occuring on that volume we maintain an internal mount counter that increments when an operation requests a volume be mounted (even if it is already mounted) and decrements it when the operation i done with the volume. If at that point the mount counter is zero then we also unmount the volume as there are no other ongoing operations happening on that volume.

In this case there is a bug in the ZFS driver that is fixed with this PR:

In this case the following steps were happening:

  1. Import unpacked the backup into new ZFS volumes.
  2. Volume was mounted ready for accessing the backup.yaml (incrementing the mount counter).
  3. Restoration process completes (from a storage driver perspective) and a revert hook function is returned that can be called to clean up the imported ZFS volumes if a future error occurs.
  4. DB import process begins by accessing the backup.yaml on the mounted volume and tries to recreate the DB records.
  5. DB import process fails due to missing profile.
  6. Revert hook is run removing the ZFS volumes (and by implication the actual ZFS mount is cleared).
  7. However because LXD wasn’t aware of this implicit unmount, the mount counter is left at 1.
  8. Subsequent import succeeds, and after DB import phase, we try to unmount the volume in order to resize it to the size specified in the newly created DB records.
  9. This fails with the In Use error due to the old mount counter.

The fix in the PR above is to include the UnmountVolume call that was part of the local reverter in CreateVolumeFromBackup in the returned revert hook function. This way if the DB import stage fails the volume is unmounted by LXD before its removed, and so LXD can decrement the mount counter.