Images not syncing to a restored cluster member, image copy fails with Error: Failed remote image download:

I had a crash in one of my cluster members so i have reinstalled the machine, removed it from the cluster with lxd cluster --remove --force --yes and readded it with the same name

so far so good, but the images did not sync, on the reinstalled server I only have the images that were used to launch a container since reinstall

reinstalled server:

[root@lxd11 ~]# ls -al /var/snap/lxd/common/lxd/images/
total 983438
drwx------   2 root root         4 May 25 16:45 .
drwx--x--x. 17 root root        23 May 22 18:41 ..
-rw-------   1 root root 489073017 May 25 16:45 01cded055ad49de51379ab533683abed5d94ee136ad9c0c14bf4a445a6c23a48
-rw-------   1 root root 537145683 May 22 17:37 368467b7ed1a4c3f357403f5ead7e757edb09187c3160bb27cf4cb6edc29bd72

all other servers:

[root@lxd10 ~]#  ls -al /var/snap/lxd/common/lxd/images/
total 6534083
drwx------   2 root root        22 May 22 11:59 .
drwx--x--x. 17 root root        23 Apr 19 21:47 ..
-rw-------   1 root root 489073017 May 14 23:12 01cded055ad49de51379ab533683abed5d94ee136ad9c0c14bf4a445a6c23a48
-rw-------   1 root root       688 May 14 23:27 01d77987a3f384a9785fd472f0fba97bdc31e579d02eb812d01d6e3c2c17068a
-rw-------   1 root root 137826304 May 14 23:27 01d77987a3f384a9785fd472f0fba97bdc31e579d02eb812d01d6e3c2c17068a.rootfs
-rw-------   1 root root 536849020 Feb 20 10:14 0f84b95c68ce4abb65680a5f2bed2e0587610db0cea70bbc31a9f4bff4164965
-rw-------   1 root root      1176 May 14 23:27 1341a7cbb1af78fd401c04a52023ce22b15d6c3cfe3dc1027fba5e4d3731b187
-rw-------   1 root root 130514944 May 14 23:27 1341a7cbb1af78fd401c04a52023ce22b15d6c3cfe3dc1027fba5e4d3731b187.rootfs
-rw-------   1 root root 536856546 Feb 23 13:14 179a4853352bc1eddf637b4402d4528f23665b53c91224fd4cf4fc7edc9e6ed5
-rw-------   1 root root 492253197 Apr 19 11:16 1d1a70a4ccc5fd74c479cf5e193cbc1e622d0802511f718a5d5bb82e37e928ba
-rw-------   1 root root 584767767 Jul 15  2022 3508e22f237b82fe848da6e669b22ea794cae0d5f949434fa5d45301016208f9
-rw-------   1 root root 537145683 May 22 11:59 368467b7ed1a4c3f357403f5ead7e757edb09187c3160bb27cf4cb6edc29bd72
-rw-------   1 root root 537223579 Apr 19 13:06 4ed94dbd1a1a1d3661d1f7e74b27a46b2f9712c23235acdd4d836cf20969b333
-rw-------   1 root root 536857953 Feb 22 09:37 8c628e7cb40c82242c1ebfd0c098656e7f99847cf9b43c877eb092e005f4a481
-rw-------   1 root root      1164 Apr 19 08:49 9ed948f1d06f5fb2c32a81441d7dd68d984b011919ba8397167301031fcb3e50
-rw-------   1 root root 129970176 Apr 19 08:49 9ed948f1d06f5fb2c32a81441d7dd68d984b011919ba8397167301031fcb3e50.rootfs
-rw-------   1 root root      1316 Apr 19 06:24 acae80e39b0b3143d148fcd0bc5ff584c7896389c5787567223eee6311db243a
-rw-------   1 root root 157433856 Apr 19 06:24 acae80e39b0b3143d148fcd0bc5ff584c7896389c5787567223eee6311db243a.rootfs
-rw-------   1 root root 456144502 Feb 16 15:45 b437ad1fd33eb4189c280ea639aae30bfc69981af6860e6a38d127025040d7c3
-rw-------   1 root root 536120199 Feb 14 21:49 cf72052b3dcf1ccccee77f0c6be937bf503d4b81e3c39d477d46c6babbb9fbdd
-rw-------   1 root root 456832065 Feb 20 14:13 eb419917cae575e10e14c1219d8e2dd9c6c8479414213735866bb31e95e5abed
-rw-------   1 root root 585300545 Mar 29 15:46 f68f28079c82d0d80d7d15c93678641fe598e31955e913fc53f1df3f07776166

I’m also getting the following errors when trying to copy the image away

~$  lxc image copy lxd11:f68f28079c82d0d80d7d15c93678641fe598e31955e913fc53f1df3f07776166 alexsv:
Error: Failed remote image download: Failed to cancel operation "00ff4c99-d2f6-48fe-963e-ab9746cfab70": Failed to delete remote operation "00ff4c99-d2f6-48fe-963e-ab9746cfab70" on "lxd11:8443": Only running operations can be cancelled
Exit Code: 1
~$  lxc image copy lxd11:f68f28079c82d0d80d7d15c93678641fe598e31955e913fc53f1df3f07776166 alexsv:
Error: Failed remote image download: Failed to cancel operation "a58067b0-7941-4de2-8054-0bc6c7e52c84": Failed to delete remote operation "a58067b0-7941-4de2-8054-0bc6c7e52c84" on "lxd11:8443": Only running operations can be cancelled

its different job ids every time

Thank you

What happens if you try and use one of the missing images as the basis for a new instance?
It should be opportunistically copied from another cluster member I believe.

yes, that works, but shouldn’t image copy be the same case? or any other image operation that has to read the image for that matter?

Yes it should work I think, please can you file a bug issue.

I just experimented some more and realized I actually have a copy of all the images as storage volumes (I’m using zfs as backing storage if that matters) and that the images in /var/snap/led/common are only used once while importing into storage volumes and i can actually clean up /var/snap/lxd/common/lxd/images without any Ill effects. Is this correct?

If it is indeed correct, than this is actually also a case of image storage volumes not being recovered by lxd recover, should they be? It’s a bit of a moot point if they are correctly downloaded on demand, but might be easy enough to implement as a convenience.

I’m not quite following the issue, nor can I recreate.

Please can you show reproducer steps from a fresh LXD installation?

ok, here it is:

fresh installation

root@alexsv:~# lxc image list
+-------+-------------+--------+-------------+--------------+------+------+-------------+
| ALIAS | FINGERPRINT | PUBLIC | DESCRIPTION | ARCHITECTURE | TYPE | SIZE | UPLOAD DATE |
+-------+-------------+--------+-------------+--------------+------+------+-------------+
root@alexsv:~# ls -al /var/snap/lxd/common/lxd/images/
total 34
drwx------  2 root root  2 Mai 30 18:36 .
drwx--x--x 17 root root 24 Mai 25 11:33 ..

now lets initialize something from a public image, no need to start it

root@alexsv:~# lxc image list images: architecture=x86_64 type=container ubuntu/22.04/cloud -c lF
+-----------------------------+------------------------------------------------------------------+
|            ALIAS            |                           FINGERPRINT                            |
+-----------------------------+------------------------------------------------------------------+
| ubuntu/jammy/cloud (3 more) | a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7 |
+-----------------------------+------------------------------------------------------------------+

root@alexsv:~# lxc init images:a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7 testcontainer
Retrieving...
unpacking...
Creating testcontainer

ok, so far so good, we have an  image file in /var/snap/lxd/common/lxd/images/ and a storage volume with the same thing 

root@alexsv:~# ls -al /var/snap/lxd/common/lxd/images/
total 135020
drwx------  2 root root         4 Mai 30 18:42 .
drwx--x--x 17 root root        24 Mai 25 11:33 ..
-rw-r--r--  1 root root       704 Mai 30 18:42 a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7
-rw-r--r--  1 root root 138113024 Mai 30 18:42 a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7.rootfs

root@alexsv:~# lxc storage volume list default name=a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7
+-------+------------------------------------------------------------------+-------------+--------------+---------+
| TYPE  |                               NAME                               | DESCRIPTION | CONTENT-TYPE | USED BY |
+-------+------------------------------------------------------------------+-------------+--------------+---------+
| image | a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7 |             | filesystem   | 1       |
+-------+------------------------------------------------------------------+-------------+--------------+---------+

now lets remove the file and see if it lets us init again:

root@alexsv:~# rm -rf  /var/snap/lxd/common/lxd/images/*
root@alexsv:~# lxc rm -f testcontainer
root@alexsv:~# lxc init local:a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7 testcontainer
Creating testcontainer

ok, so for actual init/launch the file from  /var/snap/lxd/common/lxd/images/ is not needed

lets try to copy the image somewhere else

root@alexsv:~# lxc image copy local:a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7 lxd:
Error: Failed remote image download: open /var/snap/lxd/common/lxd/images/a28b24b8bbfbcfdfa6cd129dbf573c996f6b5e0ab3f6adc289d40df09e3585d7: no such file or directory

ok, so for copy we need the file from /var/snap/lxd/common/lxd/images/. Why not the storage volume like with init/launch? that way we can get rid of the file from /var/snap/lxd/common/lxd/images/ as soon as it is imported in the storage volume and save some storage and headaches

What is the storage pool driver?

zfs

Right, so for instances using storage pools that support the optimized image storage feature (zfs, btrfs, lvm thin, and ceph) the first time an image is used it is unpacked into an image volume on the specific storage pool where the instance will be located. Then a writable snapshot of that image volume is taken that will be used as the root disk for the instance. Subsequent instances created using that image will just take another writable snapshot of that existing image volume. In that way it is quicker to launch future instances because the unpack operation only has to occur once.

However if you’re using a storage pool that doesn’t support optimized image storage (dir, lvm non-thin) then the image is unpacked for every instance, so it needs to be kept around. Additionally, if you create a new storage pool, or the image volume gets removed from the storage pool, then it will use the downloaded file to (re-)create image volumes by unpacking it again.

If space on the root filesystem of the host is a problem, you can instruct LXD to store the compressed image files that are downloaded on a custom volume (which isn’t the same as an image volume) on the storage pool you choose.

See Change image storage - #2 by tomp

If you are deleting image files on the host and not their associated DB entries, then having errors is expected.

thanks, all is clear now, I guess I should still open a ticket so that lxc image copy triggers opportunistic download from another cluster member like lxc init does?

1 Like

That bit I’m still not clear on as I was not able to reproduce.