This started happening after the snap auto upgrade to lxd 5.2.
The original message when I would try to snapshot with
lxc snapshot container-name
is something like:
Error: Create instance snapshot (mount source): Failed to run: zfs set mountpoint=legacy canmount=noauto zfs-volume-name/containers/container-name: umount: /var/snap/lxd/common/shmounts/storage-pools/default/containers/container-name: no mount point specified.
cannot unmount '/var/snap/lxd/common/shmounts/storage-pools/default/containers/container-name': umount failed
After trying to remount, I now get this error whenever I try to snapshot or restart:
Error: Instance snapshot record count doesn't match instance snapshot volume record count
Some containers are okay, some that I haven’t yet touched, give the same below error
Error: Instance snapshot record count doesn't match instance snapshot volume record count
When attempting a snapshot or restart. I looked at one in question, trying to guess at what it was complaining about.
In the sqlite db, I ran these queries for an affected one
SELECT COUNT(*) FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name';
SELECT COUNT(*) FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name';
Both queries return a count of 31. Which is the number of snapshots I see. So not sure what it is complaining about.
I was mistaken about above, that was a good container, the bad container in question, the numbers are off.
instances_snapshots has a count of 36, while the storage_volumes_snapshots has a count of 32.
I looked at the outputs:
SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name';
SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name';
And determined by name which ones don’t match:
SELECT i.id
FROM
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name') AS i
LEFT JOIN
(SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name') AS v ON i.name = v.name
WHERE v.name IS NULL;
Which output numbers 98,94,149,945
Then I did:
lxd sql global "DELETE FROM instances_snapshots WHERE id IN(98,94,149,945)"
and was then able to take a snapshot of the container.
I have yet another container with the same issue. I suspect maybe I have so many of those is because I have another lxd taking snapshots of this one, and maybe it as in upgrade mode when it was snapshotting.
This particular the storage_volumes_snapshots 37, while the instances_snapshots count was 32. So I did the reverse of the above.
The queries I did also for the above were against a backup of the database using:
sudo cp /var/snap/lxd/common/lxd/database/global/db.bin lxd-global-220601
sqlite3 lxd-global-220601
.tables
.mode column
.headers on
SELECT v.id
FROM
(SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name') AS v
LEFT JOIN
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name') AS i ON i.name = v.name
WHERE i.name IS NULL;
Which resulted in these numbers for storage_volumes_snapshots
4701
4714
4737
4761
4779
and then I ran:
lxd sql global "DELETE FROM storage_volumes_snapshots WHERE id IN(4701,4714,4737,4761,4779)"
to delete them and was then able to take a snapshot of the container and start it up.
I’d mark as a solution, but not sure this was the right thing to do though it seemed to work.
Since today I get the same error message when trying to start my instance.
$ lxc start myinstance
Error: Instance snapshot record count doesn't match instance snapshot volume record count
This means I cannot start and work with the instance anymore.
I started this instance every day for 2 years now without problems.
I guess I got the (automatic) update to lxc 5.2 via snap yesterday too, which is causing this problem.
I am on 5.2 now.
lxc info myinstance shows 7 snapshots:
If I do zfs list rpool/lxd/containers/myinstance -t snapshot I only get 6 snapshot!
The one “2020/11/20 11:39 CET” is missing.
Not sure how to recover from this and what to do?
Should I wait updating to lxc 5.2 on my other machines?
Its possible the fault in the db still exists but lxd 5.2 is more thorough with its consistency checks (I’ve been tightening them up). Does it still happen on lxd 5.2 with a fresh instance?
I’ll confirm the check is as expected. But if so then the fix will be to bring the db records inline with expectations to avoid unexpected issues in the future.
Yeah, ok, then maybe a --repair-flag would be the thing?
If this happens in a production environment they is maybe to much sweat around to manually repair the db entries, if you never did this before.
At least the error message should have more details, where to look and take action to repair this?
BTW I have no clue how to repair this. I just reverted to 5.1, so if I update to 5.2 again I have the problem again. And also I have no clue why instance and volume count drift apart.
This extra new consistency check when generating the start time backup.yaml file is the cause of the error
Its new in lxd 5.2.
But the actual record mismatch is likely to have existed more recently, but I’ll double check the record cleanup logic on snapshot failure you described above.
Doing lxc delete instance/snapshot for the snapshots with missing volume db records should fix it and bring it inline. If that is acceptable to lose those snapshots.
Don’t just delete the problem db records otherwise you’ll leave the actual snapshots orphaned on disk.
Alternatively we will have to craft a custom insert statement to restore the missing volume db record.
Going by creation_date, in our case all recent(=>2021-09-09) containers startup fine, but the old ones(=<2021-08-04) all have issues.
Where the old containers are supposed to have 8 backups, on one container I just checked we’ve 23, going all the way back to 2021-11. Other containers are going back to 2021-08 etc.
lxc deleting snapshots does not work with the same exact error.
# lxc delete cont/autosnapshot-20220225-100052
Error: Instance snapshot record count doesn't match instance snapshot volume record count
Hrm I’ll probably have to put a lxd startup db patch in to create db record entries for the missing snapshot volume records. Or something in that backup generator as really don’t want to be dealing with an inconsistent database or backup file (kind of defeats the purpose of it otherwise).
It suggests at some point the snapshot operation was not creating storage volume db records in certain scenarios.