Images vanished

I just went on a work LXD cluster and for reasons I can’t yet figure out, all base images have suddenly vanished. I can’t see any evidence a user deleted them, but they’re all gone. Has this been observed before? Is there any way I can recover or find out what’s happened to them? There’s no evidence that any container has vanished, yet, but a lot of work seems to have just vanished

What version is that?
Can you show lxd sql global "SELECT * FROM images;"?

lxc/lxd 4.12

The query shows no results. Just to be sure, I ran the same thing on all cluster members, same result

Hmm, that’s pretty odd. Can you look into lxd.log and the archived ones (lxd.log.1 and the .gz ones) on all of your machines and look for anything mentioning images?

We have image cleanup logic that’s triggered on startup but what that does is remove on-disk leftover artifacts for any image not present in the database. We don’t have logic which would cause the reverse.

What storage backend are you using?

Probably also worth looking into /var/snap/lxd/common/lxd/images on all systems just in case the tarballs weren’t deleted at the same time as the database records.

The images directory on all systems is unfortunately blank.

Primary storage is Ceph. Local storage is ZFS. On some of the cluster members, a zfs list does show what appears to be images, but no tarballs anywhere. The logs mention pruning of expired images and there was a mention on one system of running out of disk space during a sync operation, but nothing else beyond repeating messages of updating images. I’m pretty sure they were there no more than a few days ago. It’s very very weird. Thankfully, I didn’t delete the containers from which some of the images were built. However, I admit I don’t really know what images are now gone aside from a few key ones

This is really odd, unfortunately out of the box LXD doesn’t log detailed operations, just errors and warnings so image deletion requests wouldn’t be recorded…

Those ZFS records of the images are there because ZFS cannot delete something until everything that references it is gone. In this case your containers are what’s preventing those ZFS datasets from being deleted.

Now that may be useful to you as you could recover those particular images at least.
First you’d want to do zfs list -t snapshots -o name,clones which will show you for each image what container was created from it. From that list you should be able to figure out which image you’d like to recover.

Then what you can do is zfs clone .../images/FINGERPINT@readonly POOL/tmp-img -o mountpoint=/mnt/img and then you should see your image in /mnt/img. Make a tarball of everything in that path and feed that tarball to lxc image import to have it re-imported in LXD.
When done, do zfs destroy POOL/tmp-img and repeat with the next one.

Ceph works in a similar way, so if you’re still missing images, you may see them listed as zombines in rbd ls --pool POOL in which case, you can use a similar trick using rbd clone and rbd map to get to their content, make a tarball and import that.

Thanks, I’ll give both things a try and report back with results :slight_smile:

So, it doesn’t go well. I’m trying to use the RBD command included with LXD, but due to libraries being kinda all over the place it isn’t going well. I wrote a tiny wrapper around rbd to include the libraries, but I keep running into issues. Is there some better way to be doing this?

You should be able to just install ceph-common or the equivalent package for your distro and then run the rbd command that comes with it on your system.

Okay, that worked :slight_smile:

This happened to us as well, same lxd version on all of our clusters. Unfortunately I dont see any ghost images I can recover from, so I’m going to be going down the re-create path unfortunately.

lxd version 4.12 from snap, tracking latest/stable.

That’s quite worrisome… I don’t believe we have any upgrade step in 4.12 which would explain things happening on update to that version.

@kwren @atrius could both of you send me a tarball of /var/snap/lxd/common/lxd/database to stgraber at ubuntu dot com? I’m hoping that the pre-upgrade database backup may have a record of your images which would allow me to reproduce whatever happened to your systems.

Ah and until we figure this out, I’d strongly recommend you setup some backups of /var/snap/lxd/common/lxd/images so should something like this happen again, you’ll have files that you can lxc image import back into LXD.

We also had this happen with multiple LXD images after Snap auto-updated to version 4.12 (or at least, that’s the version we’re on now). Here’s what we see in the logs at the time they disappeared:

t=2021-03-16T20:02:03+0000 lvl=info msg="Pruning expired images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done pruning expired images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Pruning expired instance backups"
t=2021-03-16T20:02:03+0000 lvl=info msg="Updating images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done updating images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done pruning expired instance backups"
t=2021-03-16T20:02:03+0000 lvl=info msg="Updating instance types"
t=2021-03-16T20:02:03+0000 lvl=info msg="Expiring log files"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done expiring log files"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done updating instance types"

@stgraber Email with database logs sent - I’ll add some backups in for images we save going forward!

The database you sent me appears to be for a standalone system, not a cluster, is that correct or am I misreading the database?

In any case, I’m seeing you had an image with an alias in there and that this is now missing.
I’ll try to reproduce the issue with that.

I managed to reproduce the issue. My current guess is that the image is getting expired for some reason… Looking into that code now.

Oh crap, I think I get what’s happening… I’m assuming all of you are running LXD on systems that aren’t in UTC and that just changed DST last weekend (North America)…

I just noticed that the expiry timestamp contains timezone information, so it’s comparison to 0 would fail for any image made pre-DST on a system that switched on Sunday…

Hmm, not super sure about the DST theory anymore.
Instead it looks like it’s the cached: true database filter we have in place which isn’t applying so non-cached images like yours are getting considered when they most definitely shouldn’t…