After upgrade and reboot I can't start containers running on encrypted ZFS

I rebooted my LXD host, and right before that I upgraded releases from 20.04.x to 22.04.2. After that, my two containers that run on an encrypted ZFS dataset won’t start.

The error I get is:

# lxc start mycontainer
Error: Failed to mount "zpool0/enc/lxd-storage/containers/mycontainer" on "/var/snap/lxd/common/lxd/storage-pools/enc/containers/mycontainer" using "zfs": input/output error
Try `lxc info --show-log mycontainer` for more info

The log command doesn’t show anything useful:

# lxc info --show-log mycontainer
Name: mycontainer
Type: container
Architecture: x86_64
Created: 2021/12/06 00:31 AEDT
Last Used: 2023/02/18 15:39 AEDT


Note that after a reboot, I have to manually mount the encrypted ZFS dataset with zfs mount -l -a and entering the password. I know I did this right because I can see data in this dataset. I do note though that when I did this, I did get errors related to some backup datasets I had made in the past:

# zfs mount -l -a
cannot mount 'zpool0/containerbackups/mycontainer': Input/output error
cannot mount 'zpool0/containerbackups/myothercontainer': Input/output error

I don’t know if that’s related, or a cause.

LXD version:

# snap list lxd
Name  Version       Rev    Tracking       Publisher   Notes
lxd   5.11-ad0b61e  24483  latest/stable  canonical✓  -

The dataset is there:

# zfs list | grep mycontainer
zpool0/containerbackups/mycontainer                                                                             494M   484G     4.02T  /zpool0/containerbackups/mycontainer
zpool0/enc/lxd-storage/containers/mycontainer                                                                  5.80T   484G     5.24T  legacy
zpool0/enc/tmp/mycontainer-pristine                                                                            27.9M   484G      633M  none

The mount namespace has mounts for my other (non-encrypted) running containers but none for my encrypted ones (makes sense, seeing as they’re not running):

# nsenter --mount=/run/snapd/ns/lxd.mnt -- cat /proc/mounts | grep lx
... (snip) ...
zpool0/main/lxd-storage/containers/othercontainer /var/snap/lxd/common/lxd/storage-pools/default/containers/othercontainer zfs rw,relatime,xattr,posixacl 0 0
... (snip) ...

I tried restarting LXD once the encrypted dataset was mounted, but it didn’t help:

systemctl restart snap.lxd.daemon

This restarted all the running (non-encrypted) containers, but didn’t help in allowing me to start my encrypted one.

What can I do to get this working, please?

Anything useful in dmesg?

Nothing immediately apparent. Some audit logs talking about some binaries within some containers that run successfully, and some network events such as the lxdbr0 interface coming up and down.

Might be worth doing a zpool scrub to see if anything else is wrong?

Yeah doing that, thanks.

Are there any zpool ‘features’ that LXD makes use of? If these features came into play with LXD since my last reboot (> 1 year) the fact I might not have them enabled could be the problem

Turns out it was a ZFS error, some missing metadata after the upgrade. See openzfs#13709