Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count

This started happening after the snap auto upgrade to lxd 5.2.

The original message when I would try to snapshot with

lxc snapshot container-name

is something like:

Error: Create instance snapshot (mount source): Failed to run: zfs set mountpoint=legacy canmount=noauto zfs-volume-name/containers/container-name: umount: /var/snap/lxd/common/shmounts/storage-pools/default/containers/container-name: no mount point specified.
cannot unmount '/var/snap/lxd/common/shmounts/storage-pools/default/containers/container-name': umount failed

After trying to remount, I now get this error whenever I try to snapshot or restart:

Error: Instance snapshot record count doesn't match instance snapshot volume record count

Some containers are okay, some that I havenā€™t yet touched, give the same below error

Error: Instance snapshot record count doesn't match instance snapshot volume record count

When attempting a snapshot or restart. I looked at one in question, trying to guess at what it was complaining about.

In the sqlite db, I ran these queries for an affected one

SELECT COUNT(*) FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name';


SELECT COUNT(*) FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name';

Both queries return a count of 31. Which is the number of snapshots I see. So not sure what it is complaining about.

I was mistaken about above, that was a good container, the bad container in question, the numbers are off.

instances_snapshots has a count of 36, while the storage_volumes_snapshots has a count of 32.

I looked at the outputs:

SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name';


SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name';

And determined by name which ones donā€™t match:

SELECT i.id
FROM 
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name') AS i
 LEFT JOIN 
(SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name') AS v ON i.name = v.name
WHERE  v.name IS NULL;

Which output numbers 98,94,149,945

Then I did:

lxd sql global "DELETE FROM instances_snapshots WHERE id IN(98,94,149,945)"

and was then able to take a snapshot of the container.

I have yet another container with the same issue. I suspect maybe I have so many of those is because I have another lxd taking snapshots of this one, and maybe it as in upgrade mode when it was snapshotting.

This particular the storage_volumes_snapshots 37, while the instances_snapshots count was 32. So I did the reverse of the above.

The queries I did also for the above were against a backup of the database using:

 sudo cp /var/snap/lxd/common/lxd/database/global/db.bin lxd-global-220601

sqlite3 lxd-global-220601

.tables
.mode column
.headers on

SELECT v.id
FROM 
(SELECT vs.* FROM storage_volumes AS v INNER JOIN storage_volumes_snapshots AS vs ON v.id = vs.storage_volume_id WHERE v.name = 'container-name') AS v 

 LEFT JOIN 
(SELECT vs.* FROM instances AS v INNER JOIN instances_snapshots AS vs ON v.id = vs.instance_id WHERE v.name = 'container-name') AS i ON i.name = v.name
WHERE  i.name IS NULL;

Which resulted in these numbers for storage_volumes_snapshots

4701
4714
4737
4761
4779

and then I ran:

lxd sql global "DELETE FROM storage_volumes_snapshots WHERE id IN(4701,4714,4737,4761,4779)"

to delete them and was then able to take a snapshot of the container and start it up.

Iā€™d mark as a solution, but not sure this was the right thing to do though it seemed to work.

1 Like

Please can you log an issue at Issues Ā· lxc/lxd Ā· GitHub with the details as it may be the original unmount fault caused a record mismatch.

Please can you show ā€œlxc storage show poolā€?

Thanks

Hi!

Since today I get the same error message when trying to start my instance.

$ lxc start myinstance 
Error: Instance snapshot record count doesn't match instance snapshot volume record count

This means I cannot start and work with the instance anymore.

I started this instance every day for 2 years now without problems.
I guess I got the (automatic) update to lxc 5.2 via snap yesterday too, which is causing this problem.
I am on 5.2 now.

lxc info myinstance shows 7 snapshots:
Bildschirmfoto_2022-06-02_11-52-13

If I do zfs list rpool/lxd/containers/myinstance -t snapshot I only get 6 snapshot!
The one ā€œ2020/11/20 11:39 CETā€ is missing.

Not sure how to recover from this and what to do?
Should I wait updating to lxc 5.2 on my other machines?

Cheers.

Ok. I reverted with snap to 5.1 like this

snap revert lxd

and now it is working again.

So it must be something with 5.2 .

Its possible the fault in the db still exists but lxd 5.2 is more thorough with its consistency checks (Iā€™ve been tightening them up). Does it still happen on lxd 5.2 with a fresh instance?

I am able to start another instance, which is relatively young, but barely used.
For such cases, maybe an --ignore-errors flag could be helpful?

Iā€™ll confirm the check is as expected. But if so then the fix will be to bring the db records inline with expectations to avoid unexpected issues in the future.

Yeah, ok, then maybe a --repair-flag would be the thing?
If this happens in a production environment they is maybe to much sweat around to manually repair the db entries, if you never did this before.
At least the error message should have more details, where to look and take action to repair this?

BTW I have no clue how to repair this. I just reverted to 5.1, so if I update to 5.2 again I have the problem again. And also I have no clue why instance and volume count drift apart.

How old are the problem containers? It maybe a fault crept in a while back. This is all conjecture currently as Iā€™m not at my pc.

This extra new consistency check when generating the start time backup.yaml file is the cause of the error

Its new in lxd 5.2.

But the actual record mismatch is likely to have existed more recently, but Iā€™ll double check the record cleanup logic on snapshot failure you described above.

Doing lxc delete instance/snapshot for the snapshots with missing volume db records should fix it and bring it inline. If that is acceptable to lose those snapshots.

Donā€™t just delete the problem db records otherwise youā€™ll leave the actual snapshots orphaned on disk.

Alternatively we will have to craft a custom insert statement to restore the missing volume db record.

(coming from Can't start containers - Error: Instance snapshot record count doesn't match instance snapshot volume record count)

Going by creation_date, in our case all recent(=>2021-09-09) containers startup fine, but the old ones(=<2021-08-04) all have issues.

Where the old containers are supposed to have 8 backups, on one container I just checked weā€™ve 23, going all the way back to 2021-11. Other containers are going back to 2021-08 etc.

lxc deleting snapshots does not work with the same exact error.

# lxc delete cont/autosnapshot-20220225-100052                            
Error: Instance snapshot record count doesn't match instance snapshot volume record count

The actual original error looks like a strange path for your instance

/var/snap/lxd/common/shmounts/storage-pools/default/containers/container-name

The shmounts part is strange and looks out of place.

Can you show ā€˜lxc storage show defaultā€™ please.

Hrm Iā€™ll probably have to put a lxd startup db patch in to create db record entries for the missing snapshot volume records. Or something in that backup generator as really donā€™t want to be dealing with an inconsistent database or backup file (kind of defeats the purpose of it otherwise).

It suggests at some point the snapshot operation was not creating storage volume db records in certain scenarios.

Im not following what youā€™re meaning is here?