Database error: "sql: transaction has already been committed or rolled back"

I noticed that our containers were not purging their snapshots after their expiration.

I thought maybe upgrading the snap to 5.3 “5.3-924be6a” might help, but now all of our containers are in ERROR state.

lxc monitor --type=logging --pretty shows:

WARNING[2022-06-27T22:27:30-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG  [2022-06-27T22:27:30-07:00] Database error                                err="sql: transaction has already been committed or rolled back"
WARNING[2022-06-27T22:27:30-07:00] Failed to rollback transaction after error (Failed to fetch from "config" table: sql: Rows are closed): sql: transaction has already been committed or rolled back 
DEBUG  [2022-06-27T22:27:30-07:00] Database error                                err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG  [2022-06-27T22:27:30-07:00] Dqlite: attempt 1: server 1: connected       
DEBUG  [2022-06-27T22:27:30-07:00] GetInstanceUsage started                      instance=container project=default
DEBUG  [2022-06-27T22:27:30-07:00] GetInstanceUsage finished                     instance=container project=default
WARNING[2022-06-27T22:27:40-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1

Anyway to recover /rebuild database?

I’m fearful to stop other containers since I just tried with a testing one and now it’s unable to start again:

user@srv5:~$ lxc stop u14
user@srv5:~$ lxc start u14
Error: Failed to run: zfs set mountpoint=legacy canmount=noauto data/lxc/containers/u14: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/u14: no mount point specified.
cannot unmount ‘/var/snap/lxd/common/shmounts/storage-pools/data/containers/u14’: umount failed
Try lxc info --show-log u14 for more info
1 user@srv5:~$ lxc info --show-log u14
Name: u14
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/06/14 17:25 PDT
Last Used: 2022/06/14 17:25 PDT

Log:

So looks like you’re dealing with two separate issues. DB timeouts and the ZFS mount issue from ZFS shmounts issues · Issue #61 · lxc/lxd-pkg-snap · GitHub

The latter can be worked around through nsenter, though if you can deal with the temporary downtime, a host reboot will get you to a much more reliable state.

The former is very odd. Is your system experiencing a lot of load at the moment?
Did you try restarting the LXD daemon with systemctl reload snap.lxd.daemon?
Also take a quick look at df -h and check that you’re not running out of disk space.

I’ve ran into ZFS shmounts issues · Issue #61 · lxc/lxd-pkg-snap · GitHub many times before :slightly_frowning_face: and reboots usually do bring the system into a more usable state.

Here’s a glances screenshot to show load, the memory is a bit tight, sufficient disk space. This load is fairly consistent on this machine. I did try, systemctl reload snap.lxd.daemon but same situation.

I’m okay with loosing the snapshots if there’s a simple way to bring the containers out of ERROR

I rebooted the server, removed the lxd snap, reinstalled lxd snap. Attempted lxd recover imported containers and still same error and all containers are unable to start.

DEBUG  [2022-06-30T00:08:58-07:00] Database error                                err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG  [2022-06-30T00:08:58-07:00] Database error                                err="sql: transaction has already been committed or rolled back"
DEBUG  [2022-06-30T00:08:58-07:00] Dqlite: attempt 1: server 1: connected       
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG  [2022-06-30T00:09:08-07:00] Database error                                err="Failed to begin transaction: context deadline exceeded"
ERROR  [2022-06-30T00:09:08-07:00] Error loading storage pool                    err="Failed getting instance storage pool name: Failed to begin transaction: context deadline exceeded" instance=container_name instanceType=container project=default

When I try to edit a config, I receive an error about instances_profiles

lxc config edit container_name
Config parsing error: Failed to write backup file: Failed to get snapshots: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back

When I try to start a container

Error: Failed to begin transaction: context deadline exceeded

In the end, since I had to get the containers back online I deleted all zfs snapshots, uninstalled lxd, edited the backup.yaml config to remove snapshots and reinstalled lxd, did a lxd recover and I’m back online.

Which version of LXD did these occur on? I ran into some of the same DB and timeout messages, specifically related to LXD 5.3. If your issues were on some other version, it is not related.

Details here: Timeouts after upgrading to LXD 5.3

If your version is 5.3, the workaround may be just a small downgrade:
sudo snap revert lxd

Thanks @johanehnberg I believe they were related as I didn’t see the below error on 5.2 only 5.3:

Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back

We also saw the same message with reference to the profiles and instances_profiles table.

After the wipe of the zfs snapshots and editing backup.yaml to recover, the system seems stable at the moment on 5.3-924be6a.

However, that was an all-nighter and 4 hours of downtime.

We do have another box that is also on lxd 5.3 and it’s running, however an lxd profile edit macvlan shows troubling signs:

Config parsing error: The following instances failed to update (profile change still saved):
 - Project: default, Instance: streams: Failed to write backup file: Failed to run: zfs set mountpoint=legacy canmount=noauto data/containers/streams: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/streams: no mount point specified.
cannot unmount '/var/snap/lxd/common/shmounts/storage-pools/data/containers/streams': umount failed

If it enters this error state I will try your snap revert suggestion.

@masnax @tomp any idea what may be going on here?

I wonder if this is related to the Profiles refactoring to not necessarily query for Config ?

The unmount error will be due to the long running problem of zfs and snap not working together after a refresh (sometimes).

The transaction time out errors normally indicate a slow I/o subsystem. Its not something we’ve seen in our automated tests though.

Only difference I can think of is that in lxd 5.3 the changes @masnax has been working on mean that lxd will perform more queries inside fewer transactions (for efficiency) so if the system is performing poorly for some reason there maybe a per transaction timeout occurring more frequently. I would need to check exactly what db timeouts we have in place though.

Thank you for this insight, I wonder though what kind of performance is needed since we were unable to start even one container with 5.3 until we removed all zfs snapshots, modified backup.yaml and reinstalled?

The ZFS mount issue requires a restart rather than removing snapshots and LXD.

See https://github.com/lxc/lxd-pkg-snap/issues/61

We’ve seen this error before in relation to transaction contention related to slow I/O, see

Yes, I’m well aware of that one, happens every few months on various boxes. Will sing to the heavens and buy a round for the developers the day that one is figured out.

Any insight in to the Database errors?

Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back

Also saw the same message with reference to the profiles and instances_profiles table.

There is a 10s timeout on each transaction:

The error you are getting looks to mean that its been 10s waiting to start the transaction, which suggests contention on the database or I/O.

Focusing on the DB part of the thread and assuming we are getting the errors due to the same reason:

Since a clean 5.3 does not exhibit these issues, I would look into what is supposed to migrate or change when updating from 5.2.

Regarding I/O contention, I have confirmed the issue occurs on a system sitting at 99,9% idle, no IOwait, with about 300000 IOPS at LXD’s disposal. I would aim for LXD to work smoothly even on a 100 IOPS HDD-backed host when under load, though. But in this case, I think it is something else.

One of the errors I saw was Rows are closed. Could that indicate the new refactored code can cause a race condition where some transactions fail - potentially by being too fast?

1 Like

Can you enable debug logging whilst on lxd 5.2 and then initiate a refresh to lxd 5.3 and once the errors start happening then provide the full log?

1 Like

The rows closed is likely because of the timeout kicking in.

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

Then the contents of /var/snap/lxd/common/lxd/logs/lxd.log please