kpfa
(kpfa)
June 30, 2022, 6:39pm
15
Yes, I’m well aware of that one, happens every few months on various boxes. Will sing to the heavens and buy a round for the developers the day that one is figured out.
Any insight in to the Database errors?
Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back
Also saw the same message with reference to the profiles
and instances_profiles
table.
tomp
(Thomas Parrott)
June 30, 2022, 6:45pm
16
There is a 10s timeout on each transaction:
ctx, cancel := context.WithTimeout(ctx, time.Second*10)
defer cancel()
tx, err := db.BeginTx(ctx, nil)
if err != nil {
// If there is a leftover transaction let's try to rollback,
// we'll then retry again.
if strings.Contains(err.Error(), "cannot start a transaction within a transaction") {
_, _ = db.Exec("ROLLBACK")
}
return fmt.Errorf("Failed to begin transaction: %w", err)
The error you are getting looks to mean that its been 10s waiting to start the transaction, which suggests contention on the database or I/O.
Focusing on the DB part of the thread and assuming we are getting the errors due to the same reason:
Since a clean 5.3 does not exhibit these issues, I would look into what is supposed to migrate or change when updating from 5.2.
Regarding I/O contention, I have confirmed the issue occurs on a system sitting at 99,9% idle, no IOwait, with about 300000 IOPS at LXD’s disposal. I would aim for LXD to work smoothly even on a 100 IOPS HDD-backed host when under load, though. But in this case, I think it is something else.
One of the errors I saw was Rows are closed
. Could that indicate the new refactored code can cause a race condition where some transactions fail - potentially by being too fast?
1 Like
tomp
(Thomas Parrott)
June 30, 2022, 7:21pm
18
Can you enable debug logging whilst on lxd 5.2 and then initiate a refresh to lxd 5.3 and once the errors start happening then provide the full log?
1 Like
tomp
(Thomas Parrott)
June 30, 2022, 7:22pm
19
The rows closed is likely because of the timeout kicking in.
tomp
(Thomas Parrott)
June 30, 2022, 7:35pm
20
sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon
Then the contents of /var/snap/lxd/common/lxd/logs/lxd.log
please
tomp
(Thomas Parrott)
June 30, 2022, 10:39pm
21
Also please can you confirm host os and kernel version .
tomp
(Thomas Parrott)
June 30, 2022, 10:43pm
22
Are you still seeing the first container start and then others not on a fresh install? And are you still seeing a correlation between kernel versions and the issue?
Are you able to run a test without luks in the mix to rule that out as a contributor?
On an LXD install that is quite different from our main setup, I am not seeing this issue. It is used for CI and has the following key differences:
/var/snap/lxd on tmpfs (i.e. “ramdisk”)
LXD re-inits every boot (but uptimes can be quite long still)
No LUKS
There were most likely no containers when LXD switched from 5.2 to 5.3
At most there is normally only one container
Nested, i.e. this LXD runs in an LXD container
I will have to set up separate test servers for these procedures, but can’t do that today. I will track this thread to see if it is still needed when I get the time. Then I can also test with/without LUKS and other aspects to bisect.
In the meantime, I pinned our hosts to 5.2 so the matter is not pressing on our part.
@tomp Carrying on conversation from other thread:
Host OS version is all 20.04, kernel versions:
> for server in $servers; do echo $server; ssh $server uname -a; echo; done
es-hel-phys-2
Linux es-hel-phys-2 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-3
Linux app-hel-phys-3 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-8
Linux app-hel-phys-8 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
es-hel-phys-3
Linux es-hel-phys-3 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-2
Linux app-hel-phys-2 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-5
Linux app-hel-phys-5 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
es-hel-phys-1
Linux es-hel-phys-1 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-4
Linux app-hel-phys-4 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-7
Linux app-hel-phys-7 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-5-phys
Linux hetzner-inference-5-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-8-phys
Linux hetzner-inference-8-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-9-phys
Linux hetzner-inference-9-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-staging-phys
Linux hetzner-inference-staging-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-2-phys
Linux hetzner-inference-2-phys 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-3-phys
Linux hetzner-inference-3-phys 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-4-phys
Linux hetzner-inference-4-phys 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-7-phys
Linux hetzner-inference-7-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-6-phys
Linux hetzner-inference-6-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
monitoring
Linux monitoring 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-6
Linux app-hel-phys-6 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
hetzner-inference-1-phys
Linux hetzner-inference-1-phys 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-1
Linux app-ovh-phys-1 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-2
Linux app-ovh-phys-2 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-3
Linux app-ovh-phys-3 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-4
Linux app-ovh-phys-4 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-5
Linux app-ovh-phys-5 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-6
Linux app-ovh-phys-6 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-staging-phys-1
Linux app-staging-phys-1 5.13.0-41-generic #46~20.04.1-Ubuntu SMP Wed Apr 20 13:16:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-7
Linux app-ovh-phys-7 5.4.0-121-generic #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-ovh-phys-8
Linux app-ovh-phys-8 5.4.0-121-generic #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
es-ovh-phys-1
Linux es-ovh-phys-1 5.4.0-121-generic #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
es-ovh-phys-2
Linux es-ovh-phys-2 5.4.0-121-generic #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
es-ovh-phys-3
Linux es-ovh-phys-3 5.4.0-121-generic #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
app-hel-phys-1
Linux app-hel-phys-1 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Do you want log files from all the hosts, or just the database hosts?
tomp
(Thomas Parrott)
July 1, 2022, 9:27am
26
Thanks, we could really do with the debug logs from the affected systems.
tomp
(Thomas Parrott)
July 1, 2022, 12:57pm
27
Myself and @masnax have tried reproducing this error to no avail.
I just tried a freshly installed ubuntu 20.04 system (kernel 5.4.0-121-generic), with a ZFS pool ontop of LVM logical volume (for added layering), then installed the 5.2/stable
LXD channel, configured LXD to use the logical volume as the block device of the ZFS pool, and then launched 5 ubuntu/focal
containers.
I tried refreshing several times between LXD 5.2 and LXD 5.3 to try and reproduce it, but no luck.
With them still running, I then refreshed to latest/stable
which installed 5.3-924be6a
and ran lxc ls
and the container list with IPs and status returned quickly as normal.
tomp
(Thomas Parrott)
July 1, 2022, 1:26pm
28
Can I also ask if anyone here who is affected by this are they using disk
devices on their containers to pass in paths from the host into the container?
In our case the errors were observed on hosts with containers using disk devices and on hosts without any such container.
tomp
(Thomas Parrott)
July 1, 2022, 1:32pm
30
Its a strange one. There’s clearly something you all have in common, but can’t figure out what it is yet.
I’ve just done a LXD 5.2 to LXD 5.3 cluster upgrade too and that went fine.
tomp
(Thomas Parrott)
July 1, 2022, 2:27pm
32
Thanks for this.
It looks like app-hel-phys-4
is leader, can you get the output of lxc cluster ls
from that member to confirm?
Also can you double check that all members are online and running the same LXD version using snap info lxd
as I saw an instance of this:
time="2022-06-30T14:01:37Z" level=warning msg="Could not notify all nodes of database upgrade" err="failed to notify peer app-hel-phys-2:8443: failed to notify node about completed upgrade: Patch \"https://app-hel-phys-2:8443/internal/database\": Unable to connect to: app-hel-phys-2:8443 ([dial tcp 10.145.8.216:8443: connect: connection refused])"
tomp
(Thomas Parrott)
July 1, 2022, 2:31pm
33
Looks like there could be some DNS issues too:
time="2022-07-01T13:30:51Z" level=warning msg="Failed adding member event listener client" err="lookup es-ovh-phys-3 on 213.186.33.99:53: no such host" local="10.145.96.163:8443" remote="es-ovh-phys-3:8443"
Output
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-1 | https://app-hel-phys-1:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-2 | https://app-hel-phys-2:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-3 | https://app-hel-phys-3:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-4 | https://app-hel-phys-4:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-5 | https://app-hel-phys-5:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-6 | https://app-hel-phys-6:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-7 | https://app-hel-phys-7:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-hel-phys-8 | https://app-hel-phys-8:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-1 | https://app-ovh-phys-1:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-2 | https://app-ovh-phys-2:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-3 | https://app-ovh-phys-3:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-4 | https://app-ovh-phys-4:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-5 | https://app-ovh-phys-5:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-6 | https://app-ovh-phys-6:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-7 | https://app-ovh-phys-7:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| app-ovh-phys-8 | https://app-ovh-phys-8:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-1 | https://es-hel-phys-1:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-2 | https://es-hel-phys-2:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-hel-phys-3 | https://es-hel-phys-3:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-ovh-phys-2 | https://es-ovh-phys-2:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| es-ovh-phys-3 | https://es-ovh-phys-3:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-1-phys | https://hetzner-inference-1-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-2-phys | https://hetzner-inference-2-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-3-phys | https://hetzner-inference-3-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-4-phys | https://hetzner-inference-4-phys:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-5-phys | https://hetzner-inference-5-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-6-phys | https://hetzner-inference-6-phys:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-7-phys | https://hetzner-inference-7-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-8-phys | https://hetzner-inference-8-phys:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-9-phys | https://hetzner-inference-9-phys:8443 | database | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| hetzner-inference-staging-phys | https://hetzner-inference-staging-phys:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| monitoring | https://monitoring:8443 | | x86_64 | default | | ONLINE | Fully operational |
+--------------------------------+---------------------------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
Yep they are
tomp:
as I saw an instance of this:
time="2022-06-30T14:01:37Z" level=warning msg="Could not notify all nodes of database upgrade" err="failed to notify peer app-hel-phys-2:8443: failed to notify node about completed upgrade: Patch \"https://app-hel-phys-2:8443/internal/database\": Unable to connect to: app-hel-phys-2:8443 ([dial tcp 10.145.8.216:8443: connect: connection refused])"
That’s odd. Just tested from another host, and it works fine
> nc -v app-hel-phys-2 8443
Connection to app-hel-phys-2 8443 port [tcp/*] succeeded!