Error: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found

Thanks Tom.

OK, i have the same error as described in first post. Last successful backup was on 5 April 22.
Storage pool is ZFS. But the problem is also present locally not only when copy container to another host. With lxc monitor i can see that snapshots don’t get delete with error:

description: Cleaning up expired instance snapshots
  err: 'Failed to delete expired instance snapshot "www-service/snapshot-536" in project
    "default": Instance storage pool not found'

I think @monstermunchkin Is working on a fix for zfs, only btrfs has been fixed thus far.

Hi @tomp ,
I’m not sure I have understood your statement. From which version of LXD 5.0 will the patch be included? I’m running on snap version 5.0.0-b0287c1 - is the patch supposed to be in?
Thanks to the whole dev team!

OK lets have a look (figuring out what is in the snap is a bit convoluted because we don’t provide upstream minor point releases so they have to be cherry-picked commit by commit directly into the snap package, which means the snap channel versions can diverge from the release version source git tags/tarballs), so I’ll show you my process:

First, the BTRFS fix was landed in this PR:

Then the ZFS fix was landed in this PR:

So lets take one of the commits from each of the PRs :

Now lets check what is the latest snap revision in latest/stable:

snap info lxd | grep 'latest/stable'
  latest/stable:    5.0.0-b0287c1 2022-04-20 (22923) 83MB -

So 5.0.0-b0287c1 and that b0287c1 at the end is the short commit hash from the latest-candidate git branch (as latest-candidate is used to build the latest/candidate snap, which is then pushed to latest/stable outside of git).

So looking for that commit we get:

Now we just need to check whether the commits we want are in that snapcraft.yaml file at that commit point.

There’s the BTRFS one:

But I can’t see the ZFS one 06558ff0e0fa652e25c4ff7af561d02d57a1c5a7.

Now, looking at the date of the latest-candidate commit lxd: Cherry-pick upstream bugfixes · lxc/lxd-pkg-snap@b0287c1 · GitHub, its from 6 days ago.

And we can see at the bottom of the ZFS fix PR (lxd/storage/drivers/zfs: Fix optimized refresh in migration by monstermunchkin · Pull Request #10234 · lxc/lxd · GitHub) that it was merged 5 days ago.

So looks like @stgraber has not done a cherry-pick sweep in the last week to pick up the ZFS fix yet.
I’m not sure if he is going to be doing another sweep this week or whether he is going to wait until 5.1 for the latest/stable channel (although I would expect these to be cherry-picked into the 5.0/stable LTS channel).

Thank you very much , @tomp , for the detailed answer and the time you spent on it. So I am supposed not to have the issue anymore (both source and target systems are on btrfs), but that’s not the case.
I wonder if my issue is 100% the one of that thread, as I face other weird things (that were not present before LXD 5):

  • If I try locally on the target system to remove any snapshot, it gets deleted but with the message Error: Instance storage pool not found message.
  • lxc storage volume list default (cf. below) returns an empty list. Is that possible?
  • lxc storage show default (cf. below) shows no instance attached to it (whereas the profile of my CT show that they use default storage).

I’m a little lost. Any help would be appreciated. If needed, I may open another thread.
Thank you!

lxc storage list
±------------±-------±-----------------------------------------------±------------±--------±--------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| default | btrfs | /var/snap/lxd/common/lxd/storage-pools/default | | 4 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| device1To | btrfs | /disk/hd1to | | 1 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| device500Go | btrfs | /disk/hd500go/ | | 1 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+

lxc storage info default
info:
description: “”
driver: btrfs
name: default
space used: 1.21TiB
total space: 3.63TiB
used by:
profiles:

  • backupDevice4To
  • default
  • privateCT
  • publicCT

lxc storage volume list default
±-----±-----±------------±-------------±--------+
| TYPE | NAME | DESCRIPTION | CONTENT-TYPE | USED BY |
±-----±-----±------------±-------------±--------+

lxc storage show default
config:
size: 100GB
source: /var/snap/lxd/common/lxd/storage-pools/default
volatile.initial_source: /var/snap/lxd/common/lxd/storage-pools/default
description: “”
name: default
driver: btrfs
used_by:

  • /1.0/profiles/backupDevice4To
  • /1.0/profiles/default
  • /1.0/profiles/privateCT
  • /1.0/profiles/publicCT
    status: Created
    locations:
  • none

lxc config show gitrepo
[…]
profiles:

  • publicCT
  • backupDevice4To

lxc profile show backupDevice4To
config:
user.comment: Stockage sur disque 4To
description: “”
devices:
root:
path: /
pool: default
type: disk
name: backupDevice4To
used_by:

  • /1.0/instances/gitrepo
    […]

Its possible that the original bug has left the target’s storage volume records inconsistent/missing.

Can you show me the exact command steps that are generating the errors please?

I tried 3 commands on two CT:

  • On target system: lxc delete CT1/snapshot-name. Snapshot is deleted but with message Error: Instance storage pool not found message

  • On source system: lxc copy CT2 targethost: --stateless --refresh. It returns Error: Failed instance creation: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found.

  • On source system: lxc copy CT2 targethost: --stateless --refresh --mode=push, which was working with LXD 4.x. It returns Error: Failed instance migration: websocket: close 1000 (normal).

Adding -vv to command does not generate any additionnal info.

I had these types of issues. I had to do a lot of manual cleanup which involved manually deleting the btrfs subvolumes, deleting any remaining folders and then using the lxc command to remove database entries then finally deleting the containers.

The way I did it was like this (replace items in <> as appropriate)…
(Please be careful)

Get a list of all the subvols related to the container…
btrfs subvol list /srv | grep <container>
Then delete as appropriate with…
btrfs subvol delete <subvol>

Then go into the file system and look inside the lxd folder for containers and containers-snapshots and manually delete your container inside these folders as appropriate

Then check for any lxc database entries…
lxd sql global "SELECT * FROM storage_volumes WHERE name LIKE '<container>/%';"
Then delete as appropriate with…
lxd sql global "DELETE FROM storage_volumes WHERE name = '<storage_volume>'"

You can then finally delete the container using the usual command…
lxc delete <container>

Once its deleted, make a fresh copy from the source then you can sync it again.

Although saying that attempting to refresh now gives the following:
Error: Error transferring instance data: Failed setting subvolume writable “/var/snap/lxd/common/lxd/storage-pools/default/containers/container”: Failed to run:
btrfs property set -ts /var/snap/lxd/common/lxd/storage-pools/default/containers/container ro false: ERROR: Could not get subvolume flags: Invalid argument

@monstermunchkin are you able to help with this please?

Could you please post the exact steps which lead to this error?

First copy the container to the remote offsite server:
lxc stop container
lxc copy container server:container --storage=storage --mode=relay
lxc start container

Then attempt to do a refresh:
lxc copy container server:container --storage=storage --mode=relay --refresh

My guess is this is related to the fact the containers have snapshots, which appear to be read only?

I cannot reproduce that.

Could you please also post how the containers and snapshots are created/deleted before copying or refreshing?

I’ll do some more tests on this tonight and see if I can come up with a full script.
FYI I’m running 5.0.0-b0287c1 both ends on 20.04LTS

1 Like

The plot thickens…

It appears I can’t reproduce it either with any new containers I create so it appears whatever is causing the issue is now fixed.

However my existing 15 containers which were affected by the original BTRFS problem all have the same error even though I freshly copied them across. (BTW It appears to be unrelated to the --storage option as even ones that are in the default pool both ends have the same issue).

Ahhh here we go…
I can’t delete them from the target either I get the same BTRFS property set error.

OK let me completely delete one of them, clean up and try a fresh copy from scratch again and see what happens.

OK I just did a fresh copy of one of my existing containers to the target.
Then went to delete it on the target and got the exact same BTRFS error as when I do a refresh copy.

lxc delete container
Error: Error deleting storage volume: Failed setting subvolume writable "/var/snap/lxd/common/lxd/storage-pools/default/containers/container": Failed to run: bt rfs property set -ts /var/snap/lxd/common/lxd/storage-pools/default/containers/container ro false: ERROR: Could not get subvolume flags: Invalid argument

So it is 100% related to this topic:

I will jump over and follow up on that topic so this one can be closed.

Up until now, I have always been very pleased with LXD, but I must say that such a bug perverting the integrity of a whole server and killing all containers at once is very disturbing. Fortunately, the affected server is a backup server, but yet…

I’m going to do like @DanielBull : remove everything from the backup server and redo the copies from the source aka live server.

Yes we are sorry about this, it has certainly exposed gaps in our automated testing, which have been improved now to hopefully catch these sorts of regression in the future.

@monstermunchkin has some further fixes for BTRFS optimized refresh due for LXD 5.1:

I did so too. After upgrading to version 5.1 i deleted all container on the target backup server and afterwards from the ‘lxc delete’ some left over zfs datasets belonging to the container.

After that i where able to do one successful ‘copy --refresh’. When i try to do another ‘copy --refresh’ from source to target machine i now get the folloing error:

/snap/bin/lxc copy --mode push --refresh --stateless --storage default --config boot.autostart=false archive virt-slave:archive
Error: Failed instance migration: websocket: close 1006 (abnormal closure): unexpected EOF