Thanks, I’ll give both things a try and report back with results
So, it doesn’t go well. I’m trying to use the RBD command included with LXD, but due to libraries being kinda all over the place it isn’t going well. I wrote a tiny wrapper around rbd to include the libraries, but I keep running into issues. Is there some better way to be doing this?
You should be able to just install ceph-common
or the equivalent package for your distro and then run the rbd
command that comes with it on your system.
Okay, that worked
This happened to us as well, same lxd version on all of our clusters. Unfortunately I dont see any ghost images I can recover from, so I’m going to be going down the re-create path unfortunately.
lxd version 4.12 from snap, tracking latest/stable.
That’s quite worrisome… I don’t believe we have any upgrade step in 4.12 which would explain things happening on update to that version.
@kwren @atrius could both of you send me a tarball of /var/snap/lxd/common/lxd/database
to stgraber at ubuntu dot com? I’m hoping that the pre-upgrade database backup may have a record of your images which would allow me to reproduce whatever happened to your systems.
Ah and until we figure this out, I’d strongly recommend you setup some backups of /var/snap/lxd/common/lxd/images
so should something like this happen again, you’ll have files that you can lxc image import
back into LXD.
We also had this happen with multiple LXD images after Snap auto-updated to version 4.12 (or at least, that’s the version we’re on now). Here’s what we see in the logs at the time they disappeared:
t=2021-03-16T20:02:03+0000 lvl=info msg="Pruning expired images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done pruning expired images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Pruning expired instance backups"
t=2021-03-16T20:02:03+0000 lvl=info msg="Updating images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done updating images"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done pruning expired instance backups"
t=2021-03-16T20:02:03+0000 lvl=info msg="Updating instance types"
t=2021-03-16T20:02:03+0000 lvl=info msg="Expiring log files"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done expiring log files"
t=2021-03-16T20:02:03+0000 lvl=info msg="Done updating instance types"
@stgraber Email with database logs sent - I’ll add some backups in for images we save going forward!
The database you sent me appears to be for a standalone system, not a cluster, is that correct or am I misreading the database?
In any case, I’m seeing you had an image with an alias in there and that this is now missing.
I’ll try to reproduce the issue with that.
I managed to reproduce the issue. My current guess is that the image is getting expired for some reason… Looking into that code now.
Oh crap, I think I get what’s happening… I’m assuming all of you are running LXD on systems that aren’t in UTC and that just changed DST last weekend (North America)…
I just noticed that the expiry timestamp contains timezone information, so it’s comparison to 0 would fail for any image made pre-DST on a system that switched on Sunday…
Hmm, not super sure about the DST theory anymore.
Instead it looks like it’s the cached: true
database filter we have in place which isn’t applying so non-cached images like yours are getting considered when they most definitely shouldn’t…
And got a fix for the issue. The timestamp wasn’t the problem, the auto-generated database code for that table was… When we added support for configurable image expiry, the DB filter was adjusted to filter on a per-project basis.
But the generated DB code wasn’t updated to generate the correct queries for a project+cached filter, causing the cached part to be ignored and it only filtering based on projects…
https://github.com/lxc/lxd/pull/8579 is the fix for this issue (and contains the full explanation of what happened).
Please note that anyone affected may still get re-affected by this until the bugfix is made available to all snap users! It will not hit you until 10 days since you last used a particular image, I intend to roll out the fix for this tomorrow to all stable users.
The fix has been merged upstream and cherry-picked into our snap packaging branch, I’ll now let it build and once built and CI passes on it, will push to stable, ETA is 2-3 hours.
Glad this got sorted in the end
Took longer than expected because of some issues on the package builders requiring many many retries, but it’s now all good and in stable.
The database you sent me appears to be for a standalone system, not a cluster, is that correct or am I misreading the database?
Ah, yes my mistake this one was not clustered as you noticed. It was our newest and least utilized but still had the issue!
Wow I did not expect something like this to be fixed this fast! Proves again how excellent this project is. Thanks for all your help as always!