Error: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found

DanielBull · April 14, 2022, 8:47am

Thanks Tom.

robberer · April 17, 2022, 4:53am

OK, i have the same error as described in first post. Last successful backup was on 5 April 22.
Storage pool is ZFS. But the problem is also present locally not only when copy container to another host. With lxc monitor i can see that snapshots don’t get delete with error:

description: Cleaning up expired instance snapshots
  err: 'Failed to delete expired instance snapshot "www-service/snapshot-536" in project
    "default": Instance storage pool not found'

tomp · April 17, 2022, 5:53am

I think @monstermunchkin Is working on a fix for zfs, only btrfs has been fixed thus far.

Kriss · April 24, 2022, 7:48pm

Hi @tomp ,
I’m not sure I have understood your statement. From which version of LXD 5.0 will the patch be included? I’m running on snap version 5.0.0-b0287c1 - is the patch supposed to be in?
Thanks to the whole dev team!

tomp · April 25, 2022, 7:43am

OK lets have a look (figuring out what is in the snap is a bit convoluted because we don’t provide upstream minor point releases so they have to be cherry-picked commit by commit directly into the snap package, which means the snap channel versions can diverge from the release version source git tags/tarballs), so I’ll show you my process:

First, the BTRFS fix was landed in this PR:

Then the ZFS fix was landed in this PR:

So lets take one of the commits from each of the PRs :

5db4b40402f40365b7239ffa9284d36ae1dc85c8 (lxd/storage/drivers/btrfs: Fix optimized refresh by monstermunchkin · Pull Request #10192 · lxc/lxd · GitHub)
06558ff0e0fa652e25c4ff7af561d02d57a1c5a7 (
lxd/storage/drivers/zfs: Fix optimized refresh in migration by monstermunchkin · Pull Request #10234 · lxc/lxd · GitHub)

Now lets check what is the latest snap revision in latest/stable:

snap info lxd | grep 'latest/stable'
  latest/stable:    5.0.0-b0287c1 2022-04-20 (22923) 83MB -

So 5.0.0-b0287c1 and that b0287c1 at the end is the short commit hash from the latest-candidate git branch (as latest-candidate is used to build the latest/candidate snap, which is then pushed to latest/stable outside of git).

So looking for that commit we get:

Now we just need to check whether the commits we want are in that snapcraft.yaml file at that commit point.

There’s the BTRFS one:

github.com

lxc/lxd-pkg-snap/blob/b0287c18227bc6463e92f5d0f937b9bd06d6992a/snapcraft.yaml#L1453

      
        
            git cherry-pick 678d576f8bc85342b53a92c6f12299aadc858651  # lxd/instance: Fix RuntimeLiblxcVersionAtLeast to handle ~
            git cherry-pick 8f18828c18b1ad34abe3ed2979248fa2c306485d  # shared: allow EOPNOTSUPP from llistxattr()
            git cherry-pick f7af54edfbbefd847551143e6db202b620e3c60c  # doc: update BGP server documentation
            git cherry-pick f589a2a79b5f2d7e22de6cb2e4dd576243621506  # lxd/instance/drivers/driver/qemu: Correctly detect dish source path filesystem
            git cherry-pick 724376272232da3451a1ed097b437ac41b8e206a  # lxd/instance/drivers/driver/qemu: Improve disk error context and comments
            git cherry-pick 68e5a14a0fc8bcba458a435b926783e27bf846a7  # lxd/instance/drivers/driver/qemu: Fixes incorrect FD garbage collection in addDriveConfig
            git cherry-pick 9c3fd122c9ad8467190a3f1c401171366d1006d0  # lxd/instance/drivers/driver/qemu: Remove unnecessary duplicated stat of disk source
            git cherry-pick 9c88aaf31471d81bf3334aa17b1ed074a0abf956  # lxd/instance/drivers/driver/qemu: Add input checks to addDriveConfig
            git cherry-pick 6722f7b9c9b3e3a15cdd9e40b1d24bb7391ac22e  # lxd/device/disk: Pass cloud-init:config drive to QEMU using file descriptor
            git cherry-pick 2740949ca7a52d05b31f780735550adf9d805559  # lxd/device/disk: Return explicit nil on error in localSourceOpen
            git cherry-pick 5db4b40402f40365b7239ffa9284d36ae1dc85c8  # test/suites/migration: Check optimized refresh
            git cherry-pick a45ee33d04301b436164c6581aae71700707c1f5  # lxd/storage/drivers/btrfs: Change how subvolumes are received
            git cherry-pick f0d9658f80f160008e9895e1fec2692b712d6604  # lxd/storage/drivers/btrfs: Move subvolumes after reception
            git cherry-pick f53f7ae1dadb9c9c9cd3ee07e394c19f08256cad  # lxd/storage/drivers/btrfs: Update CreateVolumeFromBackup
            git cherry-pick bcd94d75c8524fd6c6eec5b5d780eb394e0b0a9c  # lxd/instance/operationlock: Add update
            git cherry-pick 4ec4d1e50d77ead2c0bd088e747ddafead38e09f  # lxd/instance/lxc: Use locking in Update
            git cherry-pick b20f0fb570f8ff58753c9e6b1a1bf3da35c2ce68  # lxd/instance/qemu: Use locking in Update
            git cherry-pick 43ce2e214fc9d8eca44ee336288d17819765d92c  # lxd/instance/qemu: Replace container with instance
            git cherry-pick 2c1e4241b4c4c6555dc99f16991bfcab8bcd6752  # lxd/instance: Reword operationlock errors
            git cherry-pick bb9ab659397c5a8af17e901d7b90e173e2cce865  # lxd/migration/wsproto: Check websocket argument
            git cherry-pick 5d02b7bf222d76ad64037d126f7ef2ca87bb3a80  # lxd/storage/backend: Fix VolumeDBDelete revert

But I can’t see the ZFS one 06558ff0e0fa652e25c4ff7af561d02d57a1c5a7.

Now, looking at the date of the latest-candidate commit lxd: Cherry-pick upstream bugfixes · lxc/lxd-pkg-snap@b0287c1 · GitHub, its from 6 days ago.

And we can see at the bottom of the ZFS fix PR (lxd/storage/drivers/zfs: Fix optimized refresh in migration by monstermunchkin · Pull Request #10234 · lxc/lxd · GitHub) that it was merged 5 days ago.

So looks like @stgraber has not done a cherry-pick sweep in the last week to pick up the ZFS fix yet.
I’m not sure if he is going to be doing another sweep this week or whether he is going to wait until 5.1 for the latest/stable channel (although I would expect these to be cherry-picked into the 5.0/stable LTS channel).

Kriss · April 26, 2022, 8:42am

Thank you very much , @tomp , for the detailed answer and the time you spent on it. So I am supposed not to have the issue anymore (both source and target systems are on btrfs), but that’s not the case.
I wonder if my issue is 100% the one of that thread, as I face other weird things (that were not present before LXD 5):

If I try locally on the target system to remove any snapshot, it gets deleted but with the message Error: Instance storage pool not found message.
lxc storage volume list default (cf. below) returns an empty list. Is that possible?
lxc storage show default (cf. below) shows no instance attached to it (whereas the profile of my CT show that they use default storage).

I’m a little lost. Any help would be appreciated. If needed, I may open another thread.
Thank you!

lxc storage list
±------------±-------±-----------------------------------------------±------------±--------±--------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| default | btrfs | /var/snap/lxd/common/lxd/storage-pools/default | | 4 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| device1To | btrfs | /disk/hd1to | | 1 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+
| device500Go | btrfs | /disk/hd500go/ | | 1 | CREATED |
±------------±-------±-----------------------------------------------±------------±--------±--------+

lxc storage info default
info:
description: “”
driver: btrfs
name: default
space used: 1.21TiB
total space: 3.63TiB
used by:
profiles:

backupDevice4To

default

privateCT

publicCT

lxc storage volume list default
±-----±-----±------------±-------------±--------+
| TYPE | NAME | DESCRIPTION | CONTENT-TYPE | USED BY |
±-----±-----±------------±-------------±--------+

lxc storage show default
config:
size: 100GB
source: /var/snap/lxd/common/lxd/storage-pools/default
volatile.initial_source: /var/snap/lxd/common/lxd/storage-pools/default
description: “”
name: default
driver: btrfs
used_by:

/1.0/profiles/backupDevice4To

/1.0/profiles/default

/1.0/profiles/privateCT

/1.0/profiles/publicCT
status: Created
locations:

none

lxc config show gitrepo
[…]
profiles:

publicCT

backupDevice4To

lxc profile show backupDevice4To
config:
user.comment: Stockage sur disque 4To
description: “”
devices:
root:
path: /
pool: default
type: disk
name: backupDevice4To
used_by:

/1.0/instances/gitrepo
[…]

tomp · April 26, 2022, 2:21pm

Its possible that the original bug has left the target’s storage volume records inconsistent/missing.

Can you show me the exact command steps that are generating the errors please?

Kriss · April 26, 2022, 3:26pm

I tried 3 commands on two CT:

On target system: lxc delete CT1/snapshot-name. Snapshot is deleted but with message Error: Instance storage pool not found message
On source system: lxc copy CT2 targethost: --stateless --refresh. It returns Error: Failed instance creation: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found.
On source system: lxc copy CT2 targethost: --stateless --refresh --mode=push, which was working with LXD 4.x. It returns Error: Failed instance migration: websocket: close 1000 (normal).

Adding -vv to command does not generate any additionnal info.

DanielBull · April 26, 2022, 4:06pm

I had these types of issues. I had to do a lot of manual cleanup which involved manually deleting the btrfs subvolumes, deleting any remaining folders and then using the lxc command to remove database entries then finally deleting the containers.

The way I did it was like this (replace items in <> as appropriate)…
(Please be careful)

Get a list of all the subvols related to the container…
btrfs subvol list /srv | grep <container>
Then delete as appropriate with…
btrfs subvol delete <subvol>

Then go into the file system and look inside the lxd folder for containers and containers-snapshots and manually delete your container inside these folders as appropriate

Then check for any lxc database entries…
lxd sql global "SELECT * FROM storage_volumes WHERE name LIKE '<container>/%';"
Then delete as appropriate with…
lxd sql global "DELETE FROM storage_volumes WHERE name = '<storage_volume>'"

You can then finally delete the container using the usual command…
lxc delete <container>

Once its deleted, make a fresh copy from the source then you can sync it again.

Although saying that attempting to refresh now gives the following:
Error: Error transferring instance data: Failed setting subvolume writable “/var/snap/lxd/common/lxd/storage-pools/default/containers/container”: Failed to run:
btrfs property set -ts /var/snap/lxd/common/lxd/storage-pools/default/containers/container ro false: ERROR: Could not get subvolume flags: Invalid argument

tomp · April 26, 2022, 6:15pm

@monstermunchkin are you able to help with this please?

monstermunchkin · April 27, 2022, 7:19am

Could you please post the exact steps which lead to this error?

DanielBull · April 27, 2022, 7:43am

First copy the container to the remote offsite server:
lxc stop container
lxc copy container server:container --storage=storage --mode=relay
lxc start container

Then attempt to do a refresh:
lxc copy container server:container --storage=storage --mode=relay --refresh

My guess is this is related to the fact the containers have snapshots, which appear to be read only?

monstermunchkin · April 27, 2022, 7:52am

I cannot reproduce that.

Could you please also post how the containers and snapshots are created/deleted before copying or refreshing?

DanielBull · April 27, 2022, 8:05am

I’ll do some more tests on this tonight and see if I can come up with a full script.
FYI I’m running 5.0.0-b0287c1 both ends on 20.04LTS

DanielBull · April 27, 2022, 5:56pm

The plot thickens…

It appears I can’t reproduce it either with any new containers I create so it appears whatever is causing the issue is now fixed.

However my existing 15 containers which were affected by the original BTRFS problem all have the same error even though I freshly copied them across. (BTW It appears to be unrelated to the --storage option as even ones that are in the default pool both ends have the same issue).

Ahhh here we go…
I can’t delete them from the target either I get the same BTRFS property set error.

OK let me completely delete one of them, clean up and try a fresh copy from scratch again and see what happens.

DanielBull · April 27, 2022, 6:51pm

OK I just did a fresh copy of one of my existing containers to the target.
Then went to delete it on the target and got the exact same BTRFS error as when I do a refresh copy.

lxc delete container
Error: Error deleting storage volume: Failed setting subvolume writable "/var/snap/lxd/common/lxd/storage-pools/default/containers/container": Failed to run: bt rfs property set -ts /var/snap/lxd/common/lxd/storage-pools/default/containers/container ro false: ERROR: Could not get subvolume flags: Invalid argument

So it is 100% related to this topic:

I will jump over and follow up on that topic so this one can be closed.

Kriss · April 27, 2022, 11:22pm

Up until now, I have always been very pleased with LXD, but I must say that such a bug perverting the integrity of a whole server and killing all containers at once is very disturbing. Fortunately, the affected server is a backup server, but yet…

I’m going to do like @DanielBull : remove everything from the backup server and redo the copies from the source aka live server.

tomp · April 28, 2022, 7:43am

Yes we are sorry about this, it has certainly exposed gaps in our automated testing, which have been improved now to hopefully catch these sorts of regression in the future.

tomp · April 28, 2022, 9:38am

@monstermunchkin has some further fixes for BTRFS optimized refresh due for LXD 5.1:

robberer · May 4, 2022, 9:46am

I did so too. After upgrading to version 5.1 i deleted all container on the target backup server and afterwards from the ‘lxc delete’ some left over zfs datasets belonging to the container.

After that i where able to do one successful ‘copy --refresh’. When i try to do another ‘copy --refresh’ from source to target machine i now get the folloing error:

/snap/bin/lxc copy --mode push --refresh --stateless --storage default --config boot.autostart=false archive virt-slave:archive
Error: Failed instance migration: websocket: close 1006 (abnormal closure): unexpected EOF