Database error: "sql: transaction has already been committed or rolled back"

In terms of lxd db changes? That would have been our staging server (Now removed) trying to do a lxd recover, when I posted in the thread initially.

I’ve now spun that out into a separate lxd instance, and the rest of the lxd cluster is running perfectly fine now it appears - I can create new containers etc

1 Like

OK thats good to hear. If you want to try moving the leader around then running sudo systemctl reload snap.lxd.daemon on the current leader will cause it to step down and a new leader will be promoted, so you can see which gives the best performance for the most members.

Yeah already tried that last night, thanks

1 Like

OK one solved, thanks @theblazehen

@kpfa @johanehnberg as soon as you can get some logs let me know.

Thanks

Sorry it may take a while.

That said, one thing more came to my mind about the correlation: the affected hosts were all installed between May 2020 and March 2021, while the unaffected host was installed in October 2021. (The kernel versions correlate in inverse by coincidence). That means it may be worth to testing going from whatever LXD version (and potentially other packages) it was back when 20.04 was released or March 2021.

I also observed that 5.3 had caused minor data corruption. When migrating one of the hosts back to 5.2, one of its containers would no longer start up due to Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count. The container had obviously started nicely on 5.2 before so the inconsistency was introduced during its brief encounter with 5.3.

The snapshot issue was likely introduced on that container a long time ago and perhaps was not restarted since lxd 5.2 (which added consistency checks).

Unless you know it was restarted after LXD 5.2 was applied? There’s a limited patch in lxd 5.3 to try and repair the missing records where possible, but its not possible in all cases. But the patch never deletes record.

See Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count - #53 by tomp for more discussion.

So as to not confuse this thread please can you post any further details about that on the tread you linked to, thanks.

Good point. That particular host was restarted on 22 May, i.e. some days before LXD 5.2 was introduced. As such, the issue could not have been triggered even if it was present.

1 Like

We have a working theory on what is causing this, along with a pull-request to restore the LXD 5.2 performance that we are currently testing:

Our performance metrics show the time to create 128 containers has returned to pre LXD 5.3 levels (which was caused by an increase in the time it takes to do instance list, required for instance device validation):

I think I am affected by a very similar issue as described in this thread. However, LXD 5.4 (which contains the above-mentioned pull-request) doesn’t solve this for me.

My lxd server runs seven containers of which I almost daily use about three. I can tell that these were running until I granted the server a well-deserved reboot about two days ago. I noticed that before the reboot, lxc ls was taking a very long time and already showed the ERROR state on all containers. However I thought the reboot that I had scheduled anyway would fix the situation. If only I had known…

If I try to start a container the error message reads Error: Failed to get snapshots: Failed to fetch from "instance_snapshot_config" table: sql: Rows are closed So for me it is a table related to snapshots that seems to time out.
I thought this could be related to the fact that I create daily snapshots for backup purposes, which I used to transfer to another lxd instance over night. I have not yet implemented a deletion process for older snapshots so I am sure that a few hundred for every container will have accumulated by now.
The backup process has not been working for a while but still sends nightly emails, so I can tell that since June 30th the process errors out with the message Error: Failed to get snapshots: Failed to fetch from "config" table: sql: Rows are closed. Sometimes, it will show Error: Failed to get snapshots: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back instead.

This is around the time that this thread was started and LXD 5.4 was released so it has likely also been introduced with this version.

If you need any more information, please let me know! Would love to have my containers back up running as fast as possible :).

This issue was introduced with LXD 5.3 due to the database generator refactor using an inefficient query plan when listing instances and snapshots. The instance list issue was solved in LXD 5.4, although the instance snapshot list inefficiency still remains and I will be working on a fix next.

Is there anything I could do about this? Manually delete older snapshots? Downgrade to LXD 5.2 using snap? Could this break anything?

I am mainly interested in getting the containers back up running, which currently does not work because apparently, the startup queries the snapshot database.

Reducing the amount of snapshots using lxc delete <instance>/<snapshot> could help.

lxc delete does not work (same error message). I was able to list available snapshots using zfs list -t snapshot. Can I delete them as well using the zfs utilities or will this cause inconsistence between the lxd database and the actual filesystem state? Even if I can delete them this way, they would likely still be existing in the database and therefore the respective query would still need a longer time to complete, causing the above-mentioned error, right?
Could this be fixed by something like lxd recover after deleting the snapshots manually?

Yes deleting them on the storage device won’t have any impact.

I’ve been trying to reproduce the issue at hand, and I’m pretty sure I’ve found where the problem is, although I’ve not been able to reproduce the Failed to get snapshots: Failed to fetch from "config" table: sql: Rows are closed. But perhaps my system is faster/less loaded than yours and is still able to return the result sets within before the timeout.

Anyway, irrespective of that I can see certainly see a slow down of the lxc ls command for instances with hundreds of snapshots.

My working PR is here:

In my tests I reduced the lxd list time from 8.5s to 0.9s.

You could remove the snapshot records from the database manually using:

lxd sql global 'select instances.name as instance_name, instances_snapshots.id as snapshot_id, instances_snapshots.name as snapshot_name from instances_snapshots join instances on instances.id = instances_snapshots.instance_id order by instance_name, snapshot_id'
+---------------+-------------+---------------+
| instance_name | snapshot_id | snapshot_name |
+---------------+-------------+---------------+
| c1            | 415         | snap0         |
| c1            | 416         | snap1         |
| c1            | 417         | snap2         |
| c1            | 418         | snap3         |
| c1            | 419         | snap4         |
| c1            | 420         | snap5         |
| c1            | 421         | snap6         |
| c1            | 422         | snap7         |
| c1            | 423         | snap8         |
| c1            | 424         | snap9         |
| c1            | 425         | snap10        |
+---------------+-------------+---------------+

and for the associated storage volume snapshot records:

lxd sql global 'select storage_volumes.name as instance_name, storage_volumes_snapshots.id as instance_volume_snapshot_id, storage_volumes_snapshots.name as snapshot_name from storage_volumes_snapshots join storage_volumes on storage_volumes.id = storage_volumes_snapshots.storage_volume_id where storage_volumes.type = 0 order by instance_name, instance_volume_snapshot_id'
+---------------+-----------------------------+-----------------+
| instance_name | instance_volume_snapshot_id | snapshot_name |
+---------------+-----------------------------+---------------+
| c1            | 9472                        | snap0         |
| c1            | 9473                        | snap1         |
| c1            | 9474                        | snap2         |
| c1            | 9475                        | snap3         |
| c1            | 9476                        | snap4         |
| c1            | 9477                        | snap5         |
| c1            | 9478                        | snap6         |
| c1            | 9479                        | snap7         |
| c1            | 9480                        | snap8         |
| c1            | 9481                        | snap9         |
| c1            | 9482                        | snap10        |
+---------------+-----------------------------+---------------+

Identifying the rows you can remove and then doing:

lxd sql global 'delete from instances_snapshots where id = <snapshot_id>'
lxd sql global 'delete from storage_volumes_snapshots where id = <instance_volume_snapshot_id>'
1 Like

Thank you so much for your support so far!
I am getting an error Error: Failed to execute query: no such table: instance_snapshots for your first query suggestion.
The other one works fine and select count(*) from storage_volumes_snapshots shows that I accumulated about 1500 snapshots over time… :see_no_evil:

Is it okay to just delete entries from the second table or should the first query work, too? Does this indicate an error with my lxd installation?

Additionally: I suppose I should combine a database deletion with a manual deletion of the actual zfs snapshot to keep everything in sync, am I right?

Sorry its instances_snapshots.

Yes deleting the associated volume on disk would be needed too.

I’ve updated the queries to aid deleting snapshots of specific instances:

Thanks again @tomp. With some SQL tweaks (for me it worked well to filter by instances_snapshots.creation.date and delete all old snapshots up to a certain date) and further command line magic (looping over ids and ultimately over snapshot names for zfs destroy I could bring my server back up and running.

I notice that lxc ls still is a bit slow (around 11.2 seconds), but that is expected as I kept around 50 snapshots per container. Currently, no errors are shown and all containers boot up without issues. I will tweak my backup process and implement some pruning of older snapshots to avoid similar situations in the future.

1 Like