Database error: "sql: transaction has already been committed or rolled back"

I am, yes

Thanks for this.

It looks like app-hel-phys-4 is leader, can you get the output of lxc cluster ls from that member to confirm?

Also can you double check that all members are online and running the same LXD version using snap info lxd
as I saw an instance of this:

time="2022-06-30T14:01:37Z" level=warning msg="Could not notify all nodes of database upgrade" err="failed to notify peer app-hel-phys-2:8443: failed to notify node about completed upgrade: Patch \"https://app-hel-phys-2:8443/internal/database\": Unable to connect to: app-hel-phys-2:8443 ([dial tcp 10.145.8.216:8443: connect: connection refused])"

Looks like there could be some DNS issues too:

time="2022-07-01T13:30:51Z" level=warning msg="Failed adding member event listener client" err="lookup es-ovh-phys-3 on 213.186.33.99:53: no such host" local="10.145.96.163:8443" remote="es-ovh-phys-3:8443"
Output
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
|              NAME              |                     URL                     |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-1                 | https://app-hel-phys-1:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-2                 | https://app-hel-phys-2:8443                 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-3                 | https://app-hel-phys-3:8443                 | database-leader  | x86_64       | default        |             | ONLINE | Fully operational |
|                                |                                             | database         |              |                |             |        |                   |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-4                 | https://app-hel-phys-4:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-5                 | https://app-hel-phys-5:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-6                 | https://app-hel-phys-6:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-7                 | https://app-hel-phys-7:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-8                 | https://app-hel-phys-8:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-1                 | https://app-ovh-phys-1:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-2                 | https://app-ovh-phys-2:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-3                 | https://app-ovh-phys-3:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-4                 | https://app-ovh-phys-4:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-5                 | https://app-ovh-phys-5:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-6                 | https://app-ovh-phys-6:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-7                 | https://app-ovh-phys-7:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-8                 | https://app-ovh-phys-8:8443                 |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-1                  | https://es-hel-phys-1:8443                  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-2                  | https://es-hel-phys-2:8443                  | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-3                  | https://es-hel-phys-3:8443                  | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-ovh-phys-2                  | https://es-ovh-phys-2:8443                  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-ovh-phys-3                  | https://es-ovh-phys-3:8443                  |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-1-phys       | https://hetzner-inference-1-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-2-phys       | https://hetzner-inference-2-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-3-phys       | https://hetzner-inference-3-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-4-phys       | https://hetzner-inference-4-phys:8443       | database         | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-5-phys       | https://hetzner-inference-5-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-6-phys       | https://hetzner-inference-6-phys:8443       | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-7-phys       | https://hetzner-inference-7-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-8-phys       | https://hetzner-inference-8-phys:8443       |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-9-phys       | https://hetzner-inference-9-phys:8443       | database         | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-staging-phys | https://hetzner-inference-staging-phys:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| monitoring                     | https://monitoring:8443                     |                  | x86_64       | default        |             | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

Yep they are

That’s odd. Just tested from another host, and it works fine

> nc -v app-hel-phys-2 8443
Connection to app-hel-phys-2 8443 port [tcp/*] succeeded!

OK so in your case, your cluster appears to be up and running, but I am surprised it has ever worked well because its spread over 3 different geographical locations (from what I’m interpreting its Helsinki, Germany and Spain).

During the upgrade the leader is likely to have changed member, which may have meant its moved to a different country, giving very different query performance from the perspective of other members that used to be close to the leader.

Do you happen to know where the leader was before the upgrade?

What is the latency between each of the sites?

Here’s a discussion around cluster performance over WAN setups:

In LXD clusters all queries (read and writes) go to the leader from all other members, and then writes have to be replicated to the voter and standby members. So if they are spread over a wide geographical area this can really slow things down.

1 Like

Yeah, we’ve had a discussion about that before, in general it’s been reliable but slow since Heartbeat timeouts after upgrade to 4.18 - #38 by tomp

I’m afraid not

Between 20 and 35 ms

Which is the busiest site in terms of changes/activity?

In terms of lxd db changes? That would have been our staging server (Now removed) trying to do a lxd recover, when I posted in the thread initially.

I’ve now spun that out into a separate lxd instance, and the rest of the lxd cluster is running perfectly fine now it appears - I can create new containers etc

1 Like

OK thats good to hear. If you want to try moving the leader around then running sudo systemctl reload snap.lxd.daemon on the current leader will cause it to step down and a new leader will be promoted, so you can see which gives the best performance for the most members.

Yeah already tried that last night, thanks

1 Like

OK one solved, thanks @theblazehen

@kpfa @johanehnberg as soon as you can get some logs let me know.

Thanks

Sorry it may take a while.

That said, one thing more came to my mind about the correlation: the affected hosts were all installed between May 2020 and March 2021, while the unaffected host was installed in October 2021. (The kernel versions correlate in inverse by coincidence). That means it may be worth to testing going from whatever LXD version (and potentially other packages) it was back when 20.04 was released or March 2021.

I also observed that 5.3 had caused minor data corruption. When migrating one of the hosts back to 5.2, one of its containers would no longer start up due to Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count. The container had obviously started nicely on 5.2 before so the inconsistency was introduced during its brief encounter with 5.3.

The snapshot issue was likely introduced on that container a long time ago and perhaps was not restarted since lxd 5.2 (which added consistency checks).

Unless you know it was restarted after LXD 5.2 was applied? There’s a limited patch in lxd 5.3 to try and repair the missing records where possible, but its not possible in all cases. But the patch never deletes record.

See Lxc snapshot and lxc start Error: Instance snapshot record count doesn't match instance snapshot volume record count - #53 by tomp for more discussion.

So as to not confuse this thread please can you post any further details about that on the tread you linked to, thanks.

Good point. That particular host was restarted on 22 May, i.e. some days before LXD 5.2 was introduced. As such, the issue could not have been triggered even if it was present.

1 Like

We have a working theory on what is causing this, along with a pull-request to restore the LXD 5.2 performance that we are currently testing:

Our performance metrics show the time to create 128 containers has returned to pre LXD 5.3 levels (which was caused by an increase in the time it takes to do instance list, required for instance device validation):

I think I am affected by a very similar issue as described in this thread. However, LXD 5.4 (which contains the above-mentioned pull-request) doesn’t solve this for me.

My lxd server runs seven containers of which I almost daily use about three. I can tell that these were running until I granted the server a well-deserved reboot about two days ago. I noticed that before the reboot, lxc ls was taking a very long time and already showed the ERROR state on all containers. However I thought the reboot that I had scheduled anyway would fix the situation. If only I had known…

If I try to start a container the error message reads Error: Failed to get snapshots: Failed to fetch from "instance_snapshot_config" table: sql: Rows are closed So for me it is a table related to snapshots that seems to time out.
I thought this could be related to the fact that I create daily snapshots for backup purposes, which I used to transfer to another lxd instance over night. I have not yet implemented a deletion process for older snapshots so I am sure that a few hundred for every container will have accumulated by now.
The backup process has not been working for a while but still sends nightly emails, so I can tell that since June 30th the process errors out with the message Error: Failed to get snapshots: Failed to fetch from "config" table: sql: Rows are closed. Sometimes, it will show Error: Failed to get snapshots: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back instead.

This is around the time that this thread was started and LXD 5.4 was released so it has likely also been introduced with this version.

If you need any more information, please let me know! Would love to have my containers back up running as fast as possible :).

This issue was introduced with LXD 5.3 due to the database generator refactor using an inefficient query plan when listing instances and snapshots. The instance list issue was solved in LXD 5.4, although the instance snapshot list inefficiency still remains and I will be working on a fix next.

Is there anything I could do about this? Manually delete older snapshots? Downgrade to LXD 5.2 using snap? Could this break anything?

I am mainly interested in getting the containers back up running, which currently does not work because apparently, the startup queries the snapshot database.

Reducing the amount of snapshots using lxc delete <instance>/<snapshot> could help.