Migration operation fails

keesbghs · October 9, 2020, 12:33pm

Before reporting this as an issue on github I want to mention it here first.

While trying to reproduce the “no progress for copy/move” problem, I wanted to do a test, but it gave me some other error.

root@ijssel:~# lxc init ubuntu:18.04 c4 --target ijssel
Creating c4
root@ijssel:~# lxc move c4 c5 --target luts
Error: Migration operation failure: Copy instance operation failed: Failed instance creation: Error transferring instance data: open /var/snap/lxd/common/lxd/images/39a93d0b355279d430e8ce21c689aa88515212ee99874276e77f7f31ad7bf810: no such file or directory

It looks like the image of ubuntu:18.04 is not (yet) present on all targets. The image list looks OK.

root@ijssel:~# lxc image list
+----------------+--------------+--------+---------------------------------------------+--------------+-----------+----------+------------------------------+
|     ALIAS      | FINGERPRINT  | PUBLIC |                 DESCRIPTION                 | ARCHITECTURE |   TYPE    |   SIZE   |         UPLOAD DATE          |
+----------------+--------------+--------+---------------------------------------------+--------------+-----------+----------+------------------------------+
|                | 39a93d0b3552 | no     | ubuntu 18.04 LTS amd64 (release) (20200922) | x86_64       | CONTAINER | 187.70MB | Oct 9, 2020 at 10:33am (UTC) |
+----------------+--------------+--------+---------------------------------------------+--------------+-----------+----------+------------------------------+

Interestingly, if I choose a target where another container was created before, the move succeeds without a problem.

root@ijssel:~# lxc init ubuntu:18.04 c6 --target luts
Creating c6
root@ijssel:~#                            
root@ijssel:~# lxc move c4 c5 --target luts
root@ijssel:~#

stgraber · October 9, 2020, 1:28pm

What storage driver is used here?
And I’m assuming that’s on 4.6?

keesbghs · October 9, 2020, 1:29pm

Storage driver lvm
And, yes, snap lxd 4.6

keesbghs · October 9, 2020, 1:31pm

If I do nsenter --mount=/run/snapd/ns/lxd.mnt ls -lh /var/snap/lxd/common/lxd/images/, I see different result on each of my 6 nodes in the cluster.

stgraber · October 9, 2020, 2:29pm

Right, that part is normal, not all images are stored on all cluster nodes, but LXD should know about that and not fail if a particular image isn’t available locally.

keesbghs · October 9, 2020, 6:16pm

Do you want me to create an issue on github?

stgraber · October 9, 2020, 6:56pm

Sure, that will save me from having to keep this tab open to remember it

keesbghs · October 9, 2020, 7:15pm

See https://github.com/lxc/lxd/issues/8015

keesbghs · October 11, 2020, 1:45pm

Thanks for fixing the issue, @stgraber

What is interesting to notice is that eventually the images will be present on the other nodes. Is there some background process that takes care of this?

I am wondering, should all images in /var/snap/lxd/common/lxd/images/ match with the global database? Because I’m seeing an image with a fingerprint that is not in the database.

stgraber · October 11, 2020, 3:33pm

LXD does internal copies as needed.

If an image is directly imported by the user (lxc image copy or lxc image import), LXD will immediately copy it to at least 3 cluster members to ensure high availability.

For images which are merely cached and came from an external source (the most common case), LXD does not replicate them as it knows they can be fetched remotely if needed. Each server then retrieves the image when it’s needed by copying it from another cluster member or if none have it, back from the original server.

The issue you ran into was code which predated clustering and so was wrongly assuming that an image existing in the database meant it was also available on local disk. My fix is to now confirm that it both exists in the database for the current project AND that it’s locally available on disk. If not, LXD will continue as if it’s a new image, receiving the data from the source server.

keesbghs · October 11, 2020, 5:30pm

This is not quite what I saw happening. I created a container from an external source (ubuntu16.04). The image was present one one member of the cluster. None of the five other members had this image. Now two days later, without doing anything(!) I see that the image is on five of the six members in my cluster.

And one cluster member has another image which I can’t explain what it is.

Hence my question: should the images at least match what is in the global database?

keesbghs · October 11, 2020, 5:32pm

And while I typed the message I see that the Ubuntu16.04 image is now on the sixth member as well.

stgraber · October 11, 2020, 5:34pm

Yeah, all images in /var/snap/lxd/common/lxd/images should exist in the database.
On occasion that’s not the case because of LXD restarting while in the middle of an image update. For that reason we have logic on startup which will automatically prune such leftover images.

We’re also re-working the way we handle image auto-refresh to avoid such issues in general with our current plan of record being to have one cluster member in charge of each individual image and making sure they get refreshed in a consistent way, keeping in mind the various projects and servers they need to be stored on.

keesbghs · October 11, 2020, 5:40pm

Now I am starting to worry.

First, there is an image on one member that’s not in the database. It was created in the middle of the night.
Second, images got synced spontaneously. There was no human interaction at all.

stgraber · October 11, 2020, 6:01pm

There would be cause to worry if there was an image in the database which does not exist anywhere. An image which exists on disk but not in the database is most likely a remnant of a failed download or refresh. Obviously not ideal but also something that will clear itself up automatically on the next LXD restart.

Cached images (lxc image info FINGERPRINT | grep Cached) normally should not automatically sync as there is no real reason for it and it’s effectively just wasting space. However with the current way image refreshes are handled (each server does it on its own), it’s possible that it can happen as a result of a race condition.

Neither of those behaviors are cause for concerns as the only effect may be an image being more available than it strictly needs to be. It’s something we intend to rework in the coming months to have it be more predictable.