Database error: "sql: transaction has already been committed or rolled back"

kpfa · June 28, 2022, 5:44am

I noticed that our containers were not purging their snapshots after their expiration.

I thought maybe upgrading the snap to 5.3 “5.3-924be6a” might help, but now all of our containers are in ERROR state.

lxc monitor --type=logging --pretty shows:

WARNING[2022-06-27T22:27:30-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG  [2022-06-27T22:27:30-07:00] Database error                                err="sql: transaction has already been committed or rolled back"
WARNING[2022-06-27T22:27:30-07:00] Failed to rollback transaction after error (Failed to fetch from "config" table: sql: Rows are closed): sql: transaction has already been committed or rolled back 
DEBUG  [2022-06-27T22:27:30-07:00] Database error                                err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG  [2022-06-27T22:27:30-07:00] Dqlite: attempt 1: server 1: connected       
DEBUG  [2022-06-27T22:27:30-07:00] GetInstanceUsage started                      instance=container project=default
DEBUG  [2022-06-27T22:27:30-07:00] GetInstanceUsage finished                     instance=container project=default
WARNING[2022-06-27T22:27:40-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1

Anyway to recover /rebuild database?

I’m fearful to stop other containers since I just tried with a testing one and now it’s unable to start again:

user@srv5:~$ lxc stop u14
user@srv5:~$ lxc start u14
Error: Failed to run: zfs set mountpoint=legacy canmount=noauto data/lxc/containers/u14: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/u14: no mount point specified.
cannot unmount ‘/var/snap/lxd/common/shmounts/storage-pools/data/containers/u14’: umount failed
Try lxc info --show-log u14 for more info
1 user@srv5:~$ lxc info --show-log u14
Name: u14
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/06/14 17:25 PDT
Last Used: 2022/06/14 17:25 PDT

Log:

stgraber · June 28, 2022, 2:49pm

So looks like you’re dealing with two separate issues. DB timeouts and the ZFS mount issue from ZFS shmounts issues · Issue #61 · lxc/lxd-pkg-snap · GitHub

The latter can be worked around through nsenter, though if you can deal with the temporary downtime, a host reboot will get you to a much more reliable state.

The former is very odd. Is your system experiencing a lot of load at the moment?
Did you try restarting the LXD daemon with systemctl reload snap.lxd.daemon?
Also take a quick look at df -h and check that you’re not running out of disk space.

kpfa · June 28, 2022, 3:23pm

I’ve ran into ZFS shmounts issues · Issue #61 · lxc/lxd-pkg-snap · GitHub many times before and reboots usually do bring the system into a more usable state.

Here’s a glances screenshot to show load, the memory is a bit tight, sufficient disk space. This load is fairly consistent on this machine. I did try, systemctl reload snap.lxd.daemon but same situation.

I’m okay with loosing the snapshots if there’s a simple way to bring the containers out of ERROR

kpfa · June 30, 2022, 7:11am

I rebooted the server, removed the lxd snap, reinstalled lxd snap. Attempted lxd recover imported containers and still same error and all containers are unable to start.

DEBUG  [2022-06-30T00:08:58-07:00] Database error                                err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG  [2022-06-30T00:08:58-07:00] Database error                                err="sql: transaction has already been committed or rolled back"
DEBUG  [2022-06-30T00:08:58-07:00] Dqlite: attempt 1: server 1: connected       
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once          err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG  [2022-06-30T00:09:08-07:00] Database error                                err="Failed to begin transaction: context deadline exceeded"
ERROR  [2022-06-30T00:09:08-07:00] Error loading storage pool                    err="Failed getting instance storage pool name: Failed to begin transaction: context deadline exceeded" instance=container_name instanceType=container project=default

When I try to edit a config, I receive an error about instances_profiles

lxc config edit container_name
Config parsing error: Failed to write backup file: Failed to get snapshots: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back

When I try to start a container

Error: Failed to begin transaction: context deadline exceeded

In the end, since I had to get the containers back online I deleted all zfs snapshots, uninstalled lxd, edited the backup.yaml config to remove snapshots and reinstalled lxd, did a lxd recover and I’m back online.

johanehnberg · June 30, 2022, 11:22am

Which version of LXD did these occur on? I ran into some of the same DB and timeout messages, specifically related to LXD 5.3. If your issues were on some other version, it is not related.

Details here: Timeouts after upgrading to LXD 5.3

If your version is 5.3, the workaround may be just a small downgrade:
sudo snap revert lxd

kpfa · June 30, 2022, 4:25pm

Thanks @johanehnberg I believe they were related as I didn’t see the below error on 5.2 only 5.3:

Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back

We also saw the same message with reference to the profiles and instances_profiles table.

After the wipe of the zfs snapshots and editing backup.yaml to recover, the system seems stable at the moment on 5.3-924be6a.

However, that was an all-nighter and 4 hours of downtime.

We do have another box that is also on lxd 5.3 and it’s running, however an lxd profile edit macvlan shows troubling signs:

Config parsing error: The following instances failed to update (profile change still saved):
 - Project: default, Instance: streams: Failed to write backup file: Failed to run: zfs set mountpoint=legacy canmount=noauto data/containers/streams: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/streams: no mount point specified.
cannot unmount '/var/snap/lxd/common/shmounts/storage-pools/data/containers/streams': umount failed

If it enters this error state I will try your snap revert suggestion.

stgraber · June 30, 2022, 4:29pm

@masnax @tomp any idea what may be going on here?

kpfa · June 30, 2022, 4:59pm

I wonder if this is related to the Profiles refactoring to not necessarily query for Config ?

github.com/lxc/lxd

Database Refactor: Profiles

lxc:master ← masnax:db/one-tx/profiles

opened 08:04AM - 26 May 22 UTC

masnax

+1273 -750

This PR (as part of #10183) refactors `Profiles` to not necessarily query for `C…onfig`, `Devices`, and `UsedBy` on each fetch from the database. Instead dedicated functions for these fields are available: `GetProfileConfig`, `GetProfileDevices`, and `profileUsedBy`, the latter being under `lxd/profiles.go`, as it is just for reporting over the API. This also moves the `Profile` struct over to the `cluster` package, which is the new home for refactored methods. That being said, his one's a little nasty: There are a few duplicate structs. `Instance` is closely coupled with `Profile` as far as the generator is concerned, due to instances having a `Profiles` field. To work around this and keep this PR much simpler, I've avoided refactoring `Instance` at this time, so a minimal `Profiles` struct with one `ClusterTx` method for URIs still exists under the DB package. Similarly, `InstanceProfiles` and `Instances` exist both under `cluster` and `db` Once `Instance` is fully moved over to the `cluster` package, the remnants belonging to `InstanceProfile` and `Profile` will be gone as well.

tomp · June 30, 2022, 5:42pm

The unmount error will be due to the long running problem of zfs and snap not working together after a refresh (sometimes).

tomp · June 30, 2022, 5:46pm

The transaction time out errors normally indicate a slow I/o subsystem. Its not something we’ve seen in our automated tests though.

tomp · June 30, 2022, 5:48pm

Only difference I can think of is that in lxd 5.3 the changes @masnax has been working on mean that lxd will perform more queries inside fewer transactions (for efficiency) so if the system is performing poorly for some reason there maybe a per transaction timeout occurring more frequently. I would need to check exactly what db timeouts we have in place though.

kpfa · June 30, 2022, 5:53pm

Thank you for this insight, I wonder though what kind of performance is needed since we were unable to start even one container with 5.3 until we removed all zfs snapshots, modified backup.yaml and reinstalled?

tomp · June 30, 2022, 6:35pm

The ZFS mount issue requires a restart rather than removing snapshots and LXD.

See https://github.com/lxc/lxd-pkg-snap/issues/61

tomp · June 30, 2022, 6:38pm

We’ve seen this error before in relation to transaction contention related to slow I/O, see

kpfa · June 30, 2022, 6:39pm

Yes, I’m well aware of that one, happens every few months on various boxes. Will sing to the heavens and buy a round for the developers the day that one is figured out.

Any insight in to the Database errors?

Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back

Also saw the same message with reference to the profiles and instances_profiles table.

tomp · June 30, 2022, 6:45pm

There is a 10s timeout on each transaction:

github.com

lxc/lxd/blob/4a9d85de837fed097736082f8a1806812ec8c941/lxd/db/query/transaction.go#L15-L26

      
        
            	ctx, cancel := context.WithTimeout(ctx, time.Second*10)
            	defer cancel()
            
            
	tx, err := db.BeginTx(ctx, nil)
            	if err != nil {
            		// If there is a leftover transaction let's try to rollback,
            		// we'll then retry again.
            		if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {
            			_, _ = db.Exec("ROLLBACK")
            		}
            
            
		return fmt.Errorf("Failed to begin transaction: %w", err)

The error you are getting looks to mean that its been 10s waiting to start the transaction, which suggests contention on the database or I/O.

johanehnberg · June 30, 2022, 7:06pm

Focusing on the DB part of the thread and assuming we are getting the errors due to the same reason:

Since a clean 5.3 does not exhibit these issues, I would look into what is supposed to migrate or change when updating from 5.2.

Regarding I/O contention, I have confirmed the issue occurs on a system sitting at 99,9% idle, no IOwait, with about 300000 IOPS at LXD’s disposal. I would aim for LXD to work smoothly even on a 100 IOPS HDD-backed host when under load, though. But in this case, I think it is something else.

One of the errors I saw was Rows are closed. Could that indicate the new refactored code can cause a race condition where some transactions fail - potentially by being too fast?

tomp · June 30, 2022, 7:21pm

Can you enable debug logging whilst on lxd 5.2 and then initiate a refresh to lxd 5.3 and once the errors start happening then provide the full log?

tomp · June 30, 2022, 7:22pm

The rows closed is likely because of the timeout kicking in.

tomp · June 30, 2022, 7:35pm

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

Then the contents of /var/snap/lxd/common/lxd/logs/lxd.log please