After systems upgraded to LXD 5.3, we are starting to see LXD failing with timeouts.
Errors
$ lxc list a1:
+---------------+-------+------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+-------+------+------+-----------+-----------+
| c1 | ERROR | | | CONTAINER | 0 |
+---------------+-------+------+------+-----------+-----------+
| c2 | ERROR | | | CONTAINER | 0 |
+---------------+-------+------+------+-----------+-----------+
| c3 | ERROR | | | CONTAINER | 0 |
+---------------+-------+------+------+-----------+-----------+
Log entries of interest
Jun 30 01:39:13 a1 kernel: [4116620.085260] audit: type=1400 audit(1656553153.682:25959): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-control_<var-snap-lxd-common-lxd>" profile="/snap/snapd/16010/usr/lib/snapd/snap-confine" name="/tmp/#173608" pid=4088214 comm="snap-confine" requested_mask="ra" denied_mask="ra" fsuid=1001000 ouid=1000000
Jun 30 01:39:13 a1 kernel: [4116620.094074] audit: type=1400 audit(1656553153.690:25960): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-control_<var-snap-lxd-common-lxd>" profile="snap.lxd.lxc" name="/apparmor/.null" pid=4088214 comm="snap-exec" requested_mask="wr" denied_mask="wr" fsuid=1001000 ouid=0
Jun 30 05:24:42 a1 lxd.daemon[1874915]: time="2022-06-30T05:24:42Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
Jun 30 05:24:42 a1 lxd.daemon[1874915]: time="2022-06-30T05:24:42Z" level=warning msg="Failed to rollback transaction after error (Failed to fetch from \"config\" table: sql: Rows are closed): sql: transaction has already been committed or rolled back"
Jun 30 05:25:02 a1 lxd.daemon[1874915]: time="2022-06-30T05:25:02Z" level=error msg="Error loading storage pool" err="Failed getting instance storage pool name: Failed to begin transaction: context deadline exceeded" instance=odoo instanceType=container project=default
Tests and observations so far
- Same behaviour over the network with a remote lxc and locally
- Containers and their services run normally
- Restarting AppArmor service
sudo service snapd.apparmor restart
has no effect - Sometimes the commands actually run properly but even list commands take longer than before to return
- Somehow the frequency of the errors seem to correlate with kernel age or uptime. A host with kernel 5.4.0-91-generic does not seem to be seriously impacted (just a bit of slowness) while a freshly rebooted host on 5.4.0-113-generic fails on every command.
- Rebooting does not help
-
systemctl status
shows all in the green - Happens regardless of current system load; observed on 99.9% idle with no IOwait
Key environment details
- LXD using ZFS on LUKS on SSD
- Ubuntu 20.04
Any clues on where we could debug further or fix if it is the environment at fault?