Creating a new container results in zfs rename error!?

After creating a second zfs pool and re-creating my containers on the new pool, everything seemed to go smoothly. I then exported/re-imported my images and deleted the old pool. So far so good.

Now, when I try to create a new container from the imported image, the operation fails due to a zfs error, as per below:

t=2019-02-11T16:25:23+0000 lvl=info msg="Creating container" ephemeral=false name=new_container
t=2019-02-11T16:25:24+0000 lvl=info msg="Created container" ephemeral=false name=new_container
t=2019-02-11T16:25:57+0000 lvl=eror msg="zfs rename failed: umount: /var/lib/lxd/storage-pools/pool00/containers/random_container: target is busy\n        (In some cases useful info about processes that\n         use the device is found by lsof(8) or fuser(1).)\ncannot unmount '/var/lib/lxd/storage-pools/pool00/containers/random_container': umount failed\n"
t=2019-02-11T16:25:57+0000 lvl=info msg="Deleting container" created=2019-02-11T16:25:23+0000 ephemeral=false name=new_container used=1970-01-01T00:00:00+0000
t=2019-02-11T16:25:57+0000 lvl=info msg="Deleted container" created=2019-02-11T16:25:23+0000 ephemeral=false name=new_container used=1970-01-01T00:00:00+0000

If i shutdown the referenced “random_container”, another one is reported.

I am buffled as to why zfs would be trying to rename an existing dataset, let alone why it fails.

Any ideas?

What kernel are you running?

ZFS does very weird things when an image is renamed or moved, effectively causing all existing containers to get unmounted and remounted…

We did kernel work a while back to make this less of an issue, not sure if you’ve got a recent enough kernel for that though.

I just tried to create a container from an ubuntu image and it worked fine, so I got suspicious with my image import. I found 2 things

  1. It does not show up under my new_pool/images/

  2. If I try to edit the image, the lxc monitor spits this:

     metadata:
       context:
         ip: '@'
         method: PUT
         url: /1.0/images/db29f6f1afde60a2d86ff554712baa0a92fbc4bed2f2ff2910213e4e8bde4bde
       level: dbug
       message: handling
     timestamp: "2019-02-11T17:07:08.788745941Z"
     type: logging
    
    
     metadata:
       context: {}
       level: dbug
       message: 'Database error: &errors.errorString{s:"sql: no rows in result set"}'
     timestamp: "2019-02-11T17:07:08.795259536Z"
     type: logging
    

Kernel is 4.4.0-138-generic (Ubuntu 16.04.5 running LXD 3.0.3 from xenial-backports)

Ok, you’d indeed need the 4.15 kernel from bionic (hwe kernel) to have the kernel change that helps with this.

Can you show zfs list -t all?

Your symptoms sound like your have the right image in deleted/images and that LXD is attempting to move it back to images, hitting that annoying ZFS behavior…

If that’s the case, you pretty much have two solutions:

  • Stop all containers, then launch, then restart all containers, that should take care of that image for good
  • Upgrade to the HWE kernel (4.15) and reboot, which should then help ZFS deal with that weird behavior

Spot on! My image shows up under pool00/deleted/images.

I was hoping to avoid reboot/service interruption but it looks like it’s the only way.

So, bad news. Restarted on kernel 4.15.0-45 and still same behavior.

If I try to destroy the dataset after deleting the image, I get:

zfs destroy pool00/deleted/images/db29f6f1afde60a2d86ff554712baa0a92fbc4bed2f2ff2910213e4e8bde4bde
cannot destroy 'pool00/deleted/images/db29f6f1afde60a2d86ff554712baa0a92fbc4bed2f2ff2910213e4e8bde4bde': filesystem has children
use '-r' to destroy the following datasets:
pool00/deleted/images/db29f6f1afde60a2d86ff554712baa0a92fbc4bed2f2ff2910213e4e8bde4bde@readonly

Any thoughts?

This actually fixed it. Unfortunately the kernel on it’s own didn’t do it.