Error: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found

DanielBull · April 5, 2022, 9:07am

Got a strange one here…
Two servers one Ubuntu 20.04.3 and one 20.04.2
Both running the LXD snap 5.0.0-c5bcb80
Both using BTRFS backed storage pools

Every night server A backs up its LXD containers to server B using:
lxc copy container serverb:container --storage=default --refresh --mode=relay

This has been working fine for years but last night it got partially the way through the backups (backed up a few containers) and then failed.

Running the command on its own this morning returns:
Error: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found

Both servers have plenty of disk, both servers appear to be functioning perfectly, all containers are visible on both servers.

Server A has two storage pools…

lxc storage info default
info:
description: “”
driver: btrfs
name: default
space used: 280.95GiB
total space: 745.06GiB
used by:
…

lxc storage info srv
info:
description: “”
driver: btrfs
name: srv
space used: 675.60GiB
total space: 1.82TiB
used by:
instances:
…

Server B has one pool…
lxc storage info default
info:
description: “”
driver: btrfs
name: default
space used: 4.08TiB
total space: 10.92TiB
used by:
…

I’m a little baffled as I can’t figure out whats wrong, can anyone point me in the right direction as my google skills aren’t coming up with an answer?

Thanks

tomp · April 5, 2022, 9:12am

Please can you show the output of lxc config show <instance> and lxc config show <instance> --expanded for the instance that failed to transfer?

tomp · April 5, 2022, 9:15am

Also please can you show the output of lxc storage volume ls <pool> on both the source and target servers?

DanielBull · April 5, 2022, 9:38am

Hi Tom,
Thanks for getting back.
FYI: I’ve just restarted server A on the offchance, made no difference.

It fails on all containers, even the ones which were successfully synced last night.

Here is an example of one of them…

lxc config show unifi
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20200804)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: "20200804"
image.type: squashfs
image.version: "20.04"
volatile.base_image: 97c470e427c425cf2ec4d7d55b6f1397ea55043c518b194a58fc6b9da426f540
volatile.eth0.host_name: veth88e81b03
volatile.eth0.hwaddr: 00:16:3e:bf:22:e1
volatile.idmap.base: "0"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.power: RUNNING
volatile.uuid: 1e6a0762-f949-4a9c-8a75-86390d5c9200
devices:
root:
path: /
pool: srv
type: disk
ephemeral: false
profiles:
- default
stateful: false

lxc config show unifi --expanded
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 20.04 LTS amd64 (release) (20200804)
image.label: release
image.os: ubuntu
image.release: focal
image.serial: "20200804"
image.type: squashfs
image.version: "20.04"
volatile.base_image: 97c470e427c425cf2ec4d7d55b6f1397ea55043c518b194a58fc6b9da426f540
volatile.eth0.host_name: veth88e81b03
volatile.eth0.hwaddr: 00:16:3e:bf:22:e1
volatile.idmap.base: "0"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.power: RUNNING
volatile.uuid: 1e6a0762-f949-4a9c-8a75-86390d5c9200
devices:
eth0:
name: eth0
nictype: bridged
parent: br0
type: nic
root:
path: /
pool: srv
type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

DanielBull · April 5, 2022, 9:46am

OK just tried this.
On server A it listed all the containers and snapshots as expected in both the pools.

On server B it was blank. Very interesting…

lxc storage volume ls default
+------+------+-------------+--------------+---------+
| TYPE | NAME | DESCRIPTION | CONTENT-TYPE | USED BY |
+------+------+-------------+--------------+---------+

lxc storage list
+---------+--------+----------+-------------+---------+---------+
|  NAME   | DRIVER |  SOURCE  | DESCRIPTION | USED BY |  STATE  |
+---------+--------+----------+-------------+---------+---------+
| default | btrfs  | /srv/lxd |             | 1       | CREATED |
+---------+--------+----------+-------------+---------+---------+

What is also interesting is on server B only the snapshots are listed:

lxc info unifi
Name: unifi
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2021/03/23 22:49 GMT

Snapshots:
+--------------------+----------------------+------------+----------+
|        NAME        |       TAKEN AT       | EXPIRES AT | STATEFUL |
+--------------------+----------------------+------------+----------+
| 2022-03-01-monthly | 2022/03/01 00:16 GMT |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-03-13-weekly  | 2022/03/13 00:16 GMT |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-03-20-weekly  | 2022/03/20 00:17 GMT |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-03-27-weekly  | 2022/03/27 00:17 GMT |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-03-30-daily   | 2022/03/30 00:16 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-03-31-daily   | 2022/03/31 00:16 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-01-daily   | 2022/04/01 00:17 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-01-monthly | 2022/04/01 00:17 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-02-daily   | 2022/04/02 00:16 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-03-daily   | 2022/04/03 00:17 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-03-weekly  | 2022/04/03 00:17 BST |            | NO       |
+--------------------+----------------------+------------+----------+
| 2022-04-04-daily   | 2022/04/04 00:16 BST |            | NO       |
+--------------------+----------------------+------------+----------+

Looks like the containers have somehow vanished?

I guess the solution is to delete the pool and recopy across the containers and snapshots?

tomp · April 5, 2022, 9:55am

If you delete the instance on the target and then retry and see if it works. Then try it a few times to check the refresh mechanism is working.

This may be either due to a bug in LXD 5.0 or perhaps the validation in LXD 5.0 is more strict and its checking that the storage volume record exists where it didn’t previously.

tomp · April 5, 2022, 9:55am

You shouldn’t need to delete the pool, just the instance on the target.

DanielBull · April 5, 2022, 10:04am

OK thanks I’ll give it a go.

DanielBull · April 5, 2022, 10:06am

OK its failing with:
Error: Error transferring instance data: Cannot create volume, already exists on migration target storage

I’ll use BTRFS to remove the old containers

tomp · April 5, 2022, 10:13am

OK so that means just the storage volume DB records were missing, not the actual storage volumes.

Will be interesting to see if they get recreated once you’ve re-transferred them.

DanielBull · April 5, 2022, 10:30am

OK interesting…

I ran:
lxc copy container serverb:container --storage=default --mode=relay
It completed successfully.

I then ran:
lxc copy container serverb:container --storage=default --refresh --mode=relay
It failed with…
Error: Error transferring instance data: Got error reading source

So I ran it again:
lxc copy container serverb:container --storage=default --refresh --mode=relay
It failed with…
Error: Error transferring instance data: Failed getting instance storage pool name: Instance storage pool not found

I checked the storage volumes command during transfer and it showed all the volumes.
When I checked after the final error it was blank again.

tomp · April 5, 2022, 10:35am

Sounds like a LXD 5.0 bug to me. Its likely to either be to do with the storage volume DB management or the optimized BTRFS refreshes. CC @monstermunchkin

Out of interesting, can you create a dir storage pool on the target and try transferring into that rather than a btrfs one that will avoid doing the optimized refresh and hopefully help to narrow down where the issue is.

DanielBull · April 5, 2022, 10:35am

btrfs subvol list /srv | grep container
Currently shows all the snapshots but no master container (which is odd as it was there after the previous failure)

lxc list
Shows the container

lxc storage list
Shows

|  NAME   | DRIVER |  SOURCE  | DESCRIPTION | USED BY |  STATE  |
+---------+--------+----------+-------------+---------+---------+
| default | btrfs  | /srv/lxd |             | 1       | CREATED |
+---------+--------+----------+-------------+---------+---------+```

DanielBull · April 5, 2022, 11:04am

On Server B
lxc storage create test dir

On Server A

lxc copy container serverb:container --storage=test --refresh --mode=relay
lxc copy container serverb:container --storage=test --refresh --mode=relay
lxc start container
lxc copy container serverb:container --storage=test --refresh --mode=relay

No errors.
So the issue appears to be BTRFS on the target related.

tomp · April 5, 2022, 11:05am

@monstermunchkin any ideas? Looks like it could be an issue with optimized BTRFS refresh?

tomp · April 5, 2022, 2:03pm

I’ve reproduced the issue and am tracking it here:

monstermunchkin · April 6, 2022, 5:48pm

manfromafar · April 7, 2022, 9:53pm

Coping comments over from github post for zfs.

–refresh for zfs appears to be broken in multiple ways as well.
If a failure occurs you are unable to ever send the the container again even after deleting and recreating.
Resulting error code when trying to send again:

"Error: Failed instance creation: Error transferring instance data: Cannot create volume, already exists on migration target storage"

This can be easily reproduced by the following:

Snapshot local container
lxc snapshot container snapshot1
lx snapshot container snapshot2
Copy the container to a remote
lxc copy container remote:container
Snapshot the local container again
lxc snapshot container snapshot3
Remove snapshot1 from remote container
lxc delete remote:container/snapshot1
Try to refresh the local container to the remote
lxc copy container remote:container --refresh
This will fail with a zfs error about needed to remove the previous snapshots because lxd wants to recopy the entire tree.
Try to do the send again and you’ll get the migration error now locked to that container name.

Reasonable fix

LXD should default to syncing only missing snapshots from latest on target to latest on source and not doing a full resync as you may want to keep the older snapshots on the target.
An additional flag for full resync can be added, this will need to be warned with possible data deletion.
lxc already has the ability to check which snapshots exist on the target and can do a simple lookup to see which match on the source

Additional problems caused by failures in --refresh:

When a copy fails using zfs it leaves a zfs send running without a actual target.
This causes other issues when trying to clean up snapshots/containers as you are unable to delete them when a send is running.

Improvements for zfs send.

zfs send should exclude the -c flag unless explicitly asked for in the copy. The inclusion of the -c overrides the targets compression settings and uses the sources instead. Which can cause issues when not expecting the size differences.
I.E. source server runs lz4 compression but target runs zstd-6 compression to save space.
When sending bulk snapshots -I can be used to group all snapshots together instead of sending individually.
If the snapshots need to be send individually -i can be used to send only the incremental data instead

DanielBull · April 14, 2022, 8:38am

Can anyone point me in the direction of how will I know when this is fixed for BTRFS and rolled out in the snap?
Its just that I’m having to manually do backups at the moment…

tomp · April 14, 2022, 8:40am

It will be in LXD 5.1 sources.

I think it is already in the latest/stable snap channel as a cherry-pick ontop of LXD 5.0