Cluster nodes getting in state where instances are created but are in error state

dgreeley · March 15, 2023, 1:25pm

Hi, I’ve been running into very consistent errors when having an LXD (5.0.2) cluster. The base of the issue is that eventually 1-2 of the nodes out of the 3 nodes in my cluster eventually run into what seems to be a database issue where lxc list on other nodes shows the containers are created but that they are in an ‘ERROR’ state. lxc list also does not work on the node thats failed and the only way to fix it is with systemctl restart snap.lxd.daemon. Also lxc cluster list shows all nodes as ‘healthy’ during this time but obviously thats not the case.

I have created a GH issue for this as well as more logs around the time that it happens. Unfortunately its making me not love clustering anymore due to these recurring issues

github.com/lxc/lxd

Cluster nodes getting in state where instances are created but error when running

opened 02:51PM - 13 Mar 23 UTC

MrDaGree

# Required information All 3 nodes are the same versions listed below * Distribution: Ubuntu * Distribution version: 22.04 <details> <summary>lxc info</summary> ``` environment: addresses: - <node ip addr>:8443 architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE----- certificate_fingerprint: ... driver: lxc | qemu driver_version: 5.0.2 | 7.1.0 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-67-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: true server_event_mode: full-mesh server_name: <node name> server_pid: 581576 server_version: 5.0.2 storage: zfs storage_version: 2.1.4-0ubuntu0.1 storage_supported_drivers: - name: btrfs version: 5.4.1 remote: false - name: ceph version: 15.2.17 remote: true - name: cephfs version: 15.2.17 remote: true - name: cephobject version: 15.2.17 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0 remote: false - name: zfs version: 2.1.4-0ubuntu0.1 remote: false ``` </details> * The output of "lxc info" or if that fails: * Kernel version: 5.15.0-67-generic * LXC version: 5.0.2 * LXD version: 5.0.2 * Storage backend in use: ZFS # Issue description Often times I've seen that `lxc list` commands hang for a longer period of time until eventually (on nodes where lxd isn't in a weird state), it shows the status of `ERROR` for instances on the node that is in the bad state. The fix has been running `systemctl restart snap.lxd.daemon` ie, `lxc list` on node6 hangs forever and then `lxc list` on node7 or 0 will eventually populate the list. Node names in cluster: * node0 - .42 * node6 - .202 * node7 - .203 # Information to attach 1. <details> <summary>log from node0 when node6 & node7 were in troubled state</summary> ``` time="2023-03-08T15:25:21-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:25:24-06:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=13.985469534s interval=10s time="2023-03-08T15:25:33-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:25:33-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:25:34-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:25:36-06:00" level=warning msg="Excluding offline member from operations by type list" ID=1 address="<node address>.3:8443" lastHeartbeat="2023-03-08 15:25:12.600533501 -0600 -0600" member=node7 opType=28 time="2023-03-08T15:25:39-06:00" level=warning msg="Excluding offline member from operations by type list" ID=1 address="<node address>.3:8443" lastHeartbeat="2023-03-08 15:25:12.600533501 -0600 -0600" member=node7 opType=28 time="2023-03-08T15:25:39-06:00" level=warning msg="Excluding offline member from operations by type list" ID=14 address="<node address>.2:8443" lastHeartbeat="2023-03-08 15:25:18.318737357 -0600 -0600" member=node6 opType=28 time="2023-03-08T15:25:39-06:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=11.966609045s interval=10s time="2023-03-08T15:26:09-06:00" level=warning msg="Heartbeat round duration greater than heartbeat interval" duration=10.334769549s interval=10s time="2023-03-08T15:26:21-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:26:22-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:26:23-06:00" level=warning msg="Failed to rollback transaction after error (Check if project has images: Failed to fetch from \"project_config\" table: Failed to fetch from \"project_config\" table: context deadline exceeded): sql: transaction has already been committed or rolled back" time="2023-03-08T15:26:23-06:00" level=warning msg="Transaction timed out. Retrying once" err="Check if project has images: Failed to fetch from \"project_config\" table: Failed to fetch from \"project_config\" table: context deadline exceeded" member=15 time="2023-03-08T15:26:27-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:26:35-06:00" level=warning msg="Excluding offline member from operations by type list" ID=14 address="<node address>.2:8443" lastHeartbeat="2023-03-08 15:26:05.137456666 -0600 -0600" member=node6 opType=28 time="2023-03-08T15:26:35-06:00" level=warning msg="Excluding offline member from operations by type list" ID=1 address="<node address>.3:8443" lastHeartbeat="2023-03-08 15:26:05.599351057 -0600 -0600" member=node7 opType=28 time="2023-03-08T15:26:41-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 time="2023-03-08T15:26:42-06:00" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=15 ``` </details> 1. <details> <summary>log from node7 when node6 & node7 were in troubled state</summary> ``` time="2023-03-09T08:22:20-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:23:49-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:24:37-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:27:18-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:28:48-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:31:56-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/<instance name>/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="<nil>" time="2023-03-09T08:32:58-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:33:48-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" time="2023-03-09T08:34:43-06:00" level=error msg="Failed writing error for HTTP response" err="Get \"https://<node address>.2:8443/1.0/instances/<instance name>\": net/http: timeout awaiting response headers" url="/1.0/instances/{name}" writeErr="<nil>" ``` </details> 1. <details> <summary>log from node6 this morning when it was in a troubled state</summary> ``` time="2023-03-11T15:29:44-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T15:30:30-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T15:40:43-06:00" level=warning msg="Excluding offline member from operations by type list" ID=15 address="<node ipaddress>.2:8443" lastHeartbeat="2023-03-11 15:40:19.459078405 -0600 -0600" member=node0 opType=28 time="2023-03-11T15:40:43-06:00" level=warning msg="Excluding offline member from operations by type list" ID=1 address="<node ipaddress>.3:8443" lastHeartbeat="2023-03-11 15:40:22.509568256 -0600 -0600" member=node7 opType=28 time="2023-03-11T15:50:35-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T15:50:36-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T15:50:42-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T15:54:58-06:00" level=warning msg="Excluding offline member from operations by type list" ID=1 address="<node ipaddress>.3:8443" lastHeartbeat="2023-03-11 15:54:35.039073046 -0600 -0600" member=node7 opType=28 time="2023-03-11T15:54:58-06:00" level=warning msg="Excluding offline member from operations by type list" ID=15 address="<node ipaddress>.2:8443" lastHeartbeat="2023-03-11 15:54:33.27048227 -0600 -0600" member=node0 opType=28 time="2023-03-11T16:18:56-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T16:19:15-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T16:26:19-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/images/21f7d9bd58e31c0e7d113c9383d7180923faa238fd15e36cb4895f72915e47e2: no such file or directory" url="/1.0/images/{fingerprint}/export" writeErr="<nil>" time="2023-03-11T17:26:27-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/images/21f7d9bd58e31c0e7d113c9383d7180923faa238fd15e36cb4895f72915e47e2: no such file or directory" url="/1.0/images/{fingerprint}/export" writeErr="<nil>" time="2023-03-11T17:33:39-06:00" level=warning msg="Failed getting exec control websocket reader, killing command" PID=1108890 err="websocket: close 1006 (abnormal closure): unexpected EOF" instance=test-instance1 interactive=false project=default time="2023-03-11T17:34:29-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T17:37:04-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T17:53:05-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T17:57:23-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/<instance>/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="<nil>" time="2023-03-11T18:13:48-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/<instance>/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="<nil>" time="2023-03-11T18:14:36-06:00" level=warning msg="Could not apply quota because disk is in use, deferring until next start" device=root driver=disk instance=test-instance2 project=default time="2023-03-11T18:14:48-06:00" level=warning msg="Failed cleaning up config drive mount" err="Failed unmounting \"/var/snap/lxd/common/lxd/devices/<instance>/config.mount\": Failed to unmount \"/var/snap/lxd/common/lxd/devices/<instance>/config.mount\": device or resource busy" instance=test-instance1 instanceType=virtual-machine project=default time="2023-03-11T18:15:18-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:16:41-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:17:05-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:18:08-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:18:08-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:18:26-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/logs/<instance>/qemu.log: no such file or directory" url="/1.0/instances/{name}/logs/{file}" writeErr="<nil>" time="2023-03-11T18:20:50-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:23:16-06:00" level=error msg="Failed writing error for HTTP response" err="Only running operations can be connected" url="/1.0/operations/{id}/websocket" writeErr="<nil>" time="2023-03-11T18:26:28-06:00" level=error msg="Failed writing error for HTTP response" err="open /var/snap/lxd/common/lxd/images/21f7d9bd58e31c0e7d113c9383d7180923faa238fd15e36cb4895f72915e47e2: no such file or directory" url="/1.0/images/{fingerprint}/export" writeErr="<nil>" ``` </details> I do keep forgetting to run `lxc info <name> --show-log` on the instances when I fix this bug hopefully soon when it breaks again I can do that on a node where its actually working. Do let me know of any additional information that may be helpful.

tomp · March 21, 2023, 4:52pm

Please can you describe more about your setup (hardware, network etc).

Also please can you provide the output of lxc cluster list before and after the issues start.
Please can you also explain what is happening in the cluster when the problem starts, or whether there are particular timings until the problems start.

Finally, please can you explain about your cluster member names, they look a little odd. Is this just so they line up with the host name of the servers or have you added/removed cluster members in the past?

dgreeley · March 22, 2023, 1:33pm

lxc cluster list looks like this before and after (with the leader changing afterwards)

 # lxc cluster list
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
|  NAME |             URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node0 | https://<ip addr>.42:8443  | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|       |                            | database        |              |                |             |        |                   |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node6 | https://<ip addr>.202:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node7 | https://<ip addr>.202:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

When the issues start with the cluster when I start CI testing on my ansible playbooks I’ve made for my homelab + other various projects. I am running about 30 instances between the 3 nodes at a time when i start nightly runs. Shortly (anywhere from an hour to like 5-6 hours) after CI started it would start to have issues and sometimes during the day when nightly CI wasnt running.

The cluster names are just so they line up with the host name. Its only ever been these 3 servers for the LXD cluster but I’ve added/removed other nodes for various things before adding 6 & 7. They are all reasonably spec’d servers. node 6 & 7 have 40c with 256GB of ram. node 0 has 16c with 256GB of ram. All 3 are on a 1g copper network. I was also purposely running at a lower amt of containers at a time due to the cpu difference between node 0 and the others so then it wouldn’t get overloaded.

tomp · April 27, 2023, 9:50am

Hi,

Were you able to identify the actions/load of the CI testing that caused the cluster issue?

If not then I suggest starting LXD in debug mode and then getting the log file outputs when they enter this state:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

Then when in the state get the contents of /var/snap/lxd/common/lxd/logs/lxd.log

dgreeley · April 27, 2023, 12:45pm

No. I ended up completely destroying the cluster. Unfortunately this issue doesnt seem to be exactly to the cluster as I still have these 3 nodes, now completely separate lxd instance, getting hung up in the same way as before

cpaelzer · June 15, 2023, 7:53am

Hi,
as @dgreeley said this might not be cluster related at all.
We have a non-cluster setup for our regular testing and in there I might have hit the same symptom (no promise it is the same root cause).

To explain what made me find this - the search engine preview of this discussion still shows msg="Failed cleaning up config drive mount" being reported here before and that is what i see too.

What I found in the log at the time this started to break was:

$ journalctl -u snap.lxd.daemon --since "2023-06-07 20:49:00" --until "2023-06-07 20:51:00"  --no-pager
Jun 07 20:49:00 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:00Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:01 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:01Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:02 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:02Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:03 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:03Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:04 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:04Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:05 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:05Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:06 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:06Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:08 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:08Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:09 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:09Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:50:31 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:31Z" level=warning msg="Failed cleaning up config drive mount" err="Failed unmounting \"/var/snap/lxd/common/lxd/devices/cloudinit-0607-204744yygql9a4/config.mount\": Failed to unmount \"/var/snap/lxd/common/lxd/devices/cloudinit-0607-204744yygql9a4/config.mount\": device or resource busy" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:50:36 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:36Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: connection refused" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:37 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:37Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: connection refused" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:38 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:38Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: no such file or directory" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:38 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:38Z" level=warning msg="Failed getting host interface state for MTU" device=eth0 driver=nic err="route ip+net: invalid network interface name" host_name= instance=bootspeed-kvm-t2large-latest project=default

A new launch, no matter if vm or container now fails immediately.

$ lxc launch ubuntu-minimal-daily:jammy metric-server-simple-jammy-vm-c1-m1 --ephemeral --vm
Creating metric-server-simple-jammy-vm-c1-m1
Error: Failed instance creation: LXD is shutting down

Also just as reported before lxc list hangs.
Also restarting snap.lxd.daemon.service is stuck.

We are unsure if we can switch all systems to debug mode to catch more next time for the potential space or speed implications, but maybe the above already helps?

tomp · June 15, 2023, 7:57am

It would be useful to know if you are able to get into this state with LXD 5.15 which is coming out next week (or latest/edge channel) if you’re able to test on a fresh system and dont need to downgrade back to 5.0 LTS.

Because there has been some work recently to detect hung VMs and return their status as ERROR so that LXD can then force stop them.

cpaelzer · June 15, 2023, 11:16am

if you’re able to test on a fresh system and dont need to downgrade back to 5.0 LTS.

That does not apply, but we regularly keep our lxd updated.
Furthermore we do not know exactly what triggered it and so far only seen it once.

But while we can not pre-test the new version we will update in a while once it released and let you know if/when we might see it ever again. Base on your coordination with Paride this morning, we have switched on some limited extra debug to have better data in that case.

Thanks for your continuous responsiveness to help with all that!