Timeouts after upgrading to LXD 5.3

johanehnberg · June 30, 2022, 7:49am

After systems upgraded to LXD 5.3, we are starting to see LXD failing with timeouts.

Errors

$ lxc list a1:
+---------------+-------+------+------+-----------+-----------+
|     NAME      | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS |
+---------------+-------+------+------+-----------+-----------+
| c1            | ERROR |      |      | CONTAINER | 0         |
+---------------+-------+------+------+-----------+-----------+
| c2            | ERROR |      |      | CONTAINER | 0         |
+---------------+-------+------+------+-----------+-----------+
| c3            | ERROR |      |      | CONTAINER | 0         |
+---------------+-------+------+------+-----------+-----------+

Log entries of interest

Jun 30 01:39:13 a1 kernel: [4116620.085260] audit: type=1400 audit(1656553153.682:25959): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-control_<var-snap-lxd-common-lxd>" profile="/snap/snapd/16010/usr/lib/snapd/snap-confine" name="/tmp/#173608" pid=4088214 comm="snap-confine" requested_mask="ra" denied_mask="ra" fsuid=1001000 ouid=1000000
Jun 30 01:39:13 a1 kernel: [4116620.094074] audit: type=1400 audit(1656553153.690:25960): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-control_<var-snap-lxd-common-lxd>" profile="snap.lxd.lxc" name="/apparmor/.null" pid=4088214 comm="snap-exec" requested_mask="wr" denied_mask="wr" fsuid=1001000 ouid=0
Jun 30 05:24:42 a1 lxd.daemon[1874915]: time="2022-06-30T05:24:42Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
Jun 30 05:24:42 a1 lxd.daemon[1874915]: time="2022-06-30T05:24:42Z" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"config\" table: sql: Rows are closed): sql: transaction has already been committed or rolled back"
Jun 30 05:25:02 a1 lxd.daemon[1874915]: time="2022-06-30T05:25:02Z" level=error msg="Error loading storage pool" err="Failed getting instance storage pool name: Failed to begin transaction: context deadline exceeded" instance=odoo instanceType=container project=default

Tests and observations so far

Same behaviour over the network with a remote lxc and locally
Containers and their services run normally
Restarting AppArmor service sudo service snapd.apparmor restart has no effect
Sometimes the commands actually run properly but even list commands take longer than before to return
Somehow the frequency of the errors seem to correlate with kernel age or uptime. A host with kernel 5.4.0-91-generic does not seem to be seriously impacted (just a bit of slowness) while a freshly rebooted host on 5.4.0-113-generic fails on every command.
Rebooting does not help
systemctl status shows all in the green
Happens regardless of current system load; observed on 99.9% idle with no IOwait

Key environment details

LXD using ZFS on LUKS on SSD
Ubuntu 20.04

Any clues on where we could debug further or fix if it is the environment at fault?

johanehnberg · June 30, 2022, 8:12am

A further observation is that after a reboot, it seems only the alphabetically first container seems to start (we are using startup delay to stagger).

Some more example results

$ lxc start a1:c2
Error: Failed to get snapshots: Failed to fetch from "config" table: sql: Rows are closed
Try `lxc info --show-log a1:c2` for more info
$ lxc info --show-log a1:c2
Error: Failed to fetch from "config" table: sql: transaction has already been committed or rolled back
$ lxc info --show-log a1:c1
Error: Failed to fetch from "instances_profiles" table: sql: transaction has already been committed or rolled back

johanehnberg · June 30, 2022, 8:23am

I confirmed that the issue is indeed with LXD version 5.3. Going back to 5.2 makes everything work again.

Workaround
sudo snap revert lxd

theblazehen · June 30, 2022, 9:15pm

Confirming issue here as well. Running in a lxd cluster

@tomp is it possible to do a revert in a cluster environment?

tomp · June 30, 2022, 10:11pm

Please can you get us the debug logs as per

We are discussing the issue in that thread BTW.

tomp · June 30, 2022, 10:39pm

Also please can you confirm host os and kernel version .

johanehnberg · July 1, 2022, 6:35am

I went through hosts and tested occurrence. Here is an overview of kernel versions and occurrence:

5.4.0-113-generic sometimes works, but not after a reboot
5.4.0-110-generic some work
5.4.0-100-generic some work
5.4.0-91-generic works every time (only one host though)

And again, it may be about uptime or process legacy just as well since in our case those go hand-in-hand with kernel versions.

johanehnberg · July 1, 2022, 6:41am

I think this gives us a clue:

$ time lxc list
<snip>
real	0m1.343s
user	0m0.600s
sys	0m0.124s
$ lxc --version
5.2

$ time lxc list
<snip>
real	0m49.369s
user	0m0.366s
sys	0m0.799s
$ lxc --version
5.3

top shows lxd process hovering around 175-215% for the whole 49 seconds when doing lxc list on 5.3. The process load is on user with a slight bump on system time.

tomp · July 1, 2022, 1:31pm

Sadly not in this case due to DB schema or API changes (I just tried it).

If you could get us the debug logs that would be great. Thanks