I noticed that our containers were not purging their snapshots after their expiration.
I thought maybe upgrading the snap to 5.3 “5.3-924be6a” might help, but now all of our containers are in ERROR state.
lxc monitor --type=logging --pretty shows:
WARNING[2022-06-27T22:27:30-07:00] Transaction timed out. Retrying once err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG [2022-06-27T22:27:30-07:00] Database error err="sql: transaction has already been committed or rolled back"
WARNING[2022-06-27T22:27:30-07:00] Failed to rollback transaction after error (Failed to fetch from "config" table: sql: Rows are closed): sql: transaction has already been committed or rolled back
DEBUG [2022-06-27T22:27:30-07:00] Database error err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG [2022-06-27T22:27:30-07:00] Dqlite: attempt 1: server 1: connected
DEBUG [2022-06-27T22:27:30-07:00] GetInstanceUsage started instance=container project=default
DEBUG [2022-06-27T22:27:30-07:00] GetInstanceUsage finished instance=container project=default
WARNING[2022-06-27T22:27:40-07:00] Transaction timed out. Retrying once err="Failed to begin transaction: context deadline exceeded" member=1
Anyway to recover /rebuild database?
I’m fearful to stop other containers since I just tried with a testing one and now it’s unable to start again:
user@srv5:~$ lxc stop u14
user@srv5:~$ lxc start u14
Error: Failed to run: zfs set mountpoint=legacy canmount=noauto data/lxc/containers/u14: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/u14: no mount point specified.
cannot unmount ‘/var/snap/lxd/common/shmounts/storage-pools/data/containers/u14’: umount failed
Try lxc info --show-log u14 for more info
1 user@srv5:~$ lxc info --show-log u14
Name: u14
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2022/06/14 17:25 PDT
Last Used: 2022/06/14 17:25 PDT
The latter can be worked around through nsenter, though if you can deal with the temporary downtime, a host reboot will get you to a much more reliable state.
The former is very odd. Is your system experiencing a lot of load at the moment?
Did you try restarting the LXD daemon with systemctl reload snap.lxd.daemon?
Also take a quick look at df -h and check that you’re not running out of disk space.
Here’s a glances screenshot to show load, the memory is a bit tight, sufficient disk space. This load is fairly consistent on this machine. I did try, systemctl reload snap.lxd.daemon but same situation.
I’m okay with loosing the snapshots if there’s a simple way to bring the containers out of ERROR
I rebooted the server, removed the lxd snap, reinstalled lxd snap. Attempted lxd recover imported containers and still same error and all containers are unable to start.
DEBUG [2022-06-30T00:08:58-07:00] Database error err="Failed to fetch from \"config\" table: sql: Rows are closed"
DEBUG [2022-06-30T00:08:58-07:00] Database error err="sql: transaction has already been committed or rolled back"
DEBUG [2022-06-30T00:08:58-07:00] Dqlite: attempt 1: server 1: connected
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once err="Failed to begin transaction: context deadline exceeded" member=1
WARNING[2022-06-30T00:08:59-07:00] Transaction timed out. Retrying once err="Failed to begin transaction: context deadline exceeded" member=1
DEBUG [2022-06-30T00:09:08-07:00] Database error err="Failed to begin transaction: context deadline exceeded"
ERROR [2022-06-30T00:09:08-07:00] Error loading storage pool err="Failed getting instance storage pool name: Failed to begin transaction: context deadline exceeded" instance=container_name instanceType=container project=default
When I try to edit a config, I receive an error about instances_profiles
lxc config edit container_name
Config parsing error: Failed to write backup file: Failed to get snapshots: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back
When I try to start a container
Error: Failed to begin transaction: context deadline exceeded
In the end, since I had to get the containers back online I deleted all zfs snapshots, uninstalled lxd, edited the backup.yaml config to remove snapshots and reinstalled lxd, did a lxd recover and I’m back online.
Which version of LXD did these occur on? I ran into some of the same DB and timeout messages, specifically related to LXD 5.3. If your issues were on some other version, it is not related.
Thanks @johanehnberg I believe they were related as I didn’t see the below error on 5.2 only 5.3:
Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back
We also saw the same message with reference to the profiles and instances_profiles table.
After the wipe of the zfs snapshots and editing backup.yaml to recover, the system seems stable at the moment on 5.3-924be6a.
However, that was an all-nighter and 4 hours of downtime.
We do have another box that is also on lxd 5.3 and it’s running, however an lxd profile edit macvlan shows troubling signs:
Config parsing error: The following instances failed to update (profile change still saved):
- Project: default, Instance: streams: Failed to write backup file: Failed to run: zfs set mountpoint=legacy canmount=noauto data/containers/streams: umount: /var/snap/lxd/common/shmounts/storage-pools/data/containers/streams: no mount point specified.
cannot unmount '/var/snap/lxd/common/shmounts/storage-pools/data/containers/streams': umount failed
If it enters this error state I will try your snap revert suggestion.
Only difference I can think of is that in lxd 5.3 the changes @masnax has been working on mean that lxd will perform more queries inside fewer transactions (for efficiency) so if the system is performing poorly for some reason there maybe a per transaction timeout occurring more frequently. I would need to check exactly what db timeouts we have in place though.
Thank you for this insight, I wonder though what kind of performance is needed since we were unable to start even one container with 5.3 until we removed all zfs snapshots, modified backup.yaml and reinstalled?
Yes, I’m well aware of that one, happens every few months on various boxes. Will sing to the heavens and buy a round for the developers the day that one is figured out.
Any insight in to the Database errors?
Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back
Also saw the same message with reference to the profiles and instances_profiles table.
Focusing on the DB part of the thread and assuming we are getting the errors due to the same reason:
Since a clean 5.3 does not exhibit these issues, I would look into what is supposed to migrate or change when updating from 5.2.
Regarding I/O contention, I have confirmed the issue occurs on a system sitting at 99,9% idle, no IOwait, with about 300000 IOPS at LXD’s disposal. I would aim for LXD to work smoothly even on a 100 IOPS HDD-backed host when under load, though. But in this case, I think it is something else.
One of the errors I saw was Rows are closed. Could that indicate the new refactored code can cause a race condition where some transactions fail - potentially by being too fast?