Cluster nodes getting in state where instances are created but are in error state

Hi, I’ve been running into very consistent errors when having an LXD (5.0.2) cluster. The base of the issue is that eventually 1-2 of the nodes out of the 3 nodes in my cluster eventually run into what seems to be a database issue where lxc list on other nodes shows the containers are created but that they are in an ‘ERROR’ state. lxc list also does not work on the node thats failed and the only way to fix it is with systemctl restart snap.lxd.daemon. Also lxc cluster list shows all nodes as ‘healthy’ during this time but obviously thats not the case.

I have created a GH issue for this as well as more logs around the time that it happens. Unfortunately its making me not love clustering anymore due to these recurring issues

Please can you describe more about your setup (hardware, network etc).

Also please can you provide the output of lxc cluster list before and after the issues start.
Please can you also explain what is happening in the cluster when the problem starts, or whether there are particular timings until the problems start.

Finally, please can you explain about your cluster member names, they look a little odd. Is this just so they line up with the host name of the servers or have you added/removed cluster members in the past?

lxc cluster list looks like this before and after (with the leader changing afterwards)

 # lxc cluster list
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
|  NAME |             URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node0 | https://<ip addr>.42:8443  | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|       |                            | database        |              |                |             |        |                   |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node6 | https://<ip addr>.202:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node7 | https://<ip addr>.202:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+-------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+

When the issues start with the cluster when I start CI testing on my ansible playbooks I’ve made for my homelab + other various projects. I am running about 30 instances between the 3 nodes at a time when i start nightly runs. Shortly (anywhere from an hour to like 5-6 hours) after CI started it would start to have issues and sometimes during the day when nightly CI wasnt running.

The cluster names are just so they line up with the host name. Its only ever been these 3 servers for the LXD cluster but I’ve added/removed other nodes for various things before adding 6 & 7. They are all reasonably spec’d servers. node 6 & 7 have 40c with 256GB of ram. node 0 has 16c with 256GB of ram. All 3 are on a 1g copper network. I was also purposely running at a lower amt of containers at a time due to the cpu difference between node 0 and the others so then it wouldn’t get overloaded.

Hi,

Were you able to identify the actions/load of the CI testing that caused the cluster issue?

If not then I suggest starting LXD in debug mode and then getting the log file outputs when they enter this state:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

Then when in the state get the contents of /var/snap/lxd/common/lxd/logs/lxd.log

No. I ended up completely destroying the cluster. Unfortunately this issue doesnt seem to be exactly to the cluster as I still have these 3 nodes, now completely separate lxd instance, getting hung up in the same way as before

Hi,
as @dgreeley said this might not be cluster related at all.
We have a non-cluster setup for our regular testing and in there I might have hit the same symptom (no promise it is the same root cause).

To explain what made me find this - the search engine preview of this discussion still shows msg="Failed cleaning up config drive mount" being reported here before and that is what i see too.

What I found in the log at the time this started to break was:

$ journalctl -u snap.lxd.daemon --since "2023-06-07 20:49:00" --until "2023-06-07 20:51:00"  --no-pager
Jun 07 20:49:00 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:00Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:01 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:01Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:02 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:02Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:03 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:03Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:04 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:04Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:05 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:05Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:06 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:06Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:08 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:08Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:49:09 kecleon lxd.daemon[2380987]: time="2023-06-07T20:49:09Z" level=warning msg="Error getting disk usage" err="Failed to run: zfs get -H -p -o value used tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block: exit status 1 (cannot open 'tank/lxd/virtual-machines/cloudinit-0607-204744yygql9a4.block': dataset does not exist)" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:50:31 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:31Z" level=warning msg="Failed cleaning up config drive mount" err="Failed unmounting \"/var/snap/lxd/common/lxd/devices/cloudinit-0607-204744yygql9a4/config.mount\": Failed to unmount \"/var/snap/lxd/common/lxd/devices/cloudinit-0607-204744yygql9a4/config.mount\": device or resource busy" instance=cloudinit-0607-204744yygql9a4 instanceType=virtual-machine project=default
Jun 07 20:50:36 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:36Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: connection refused" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:37 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:37Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: connection refused" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:38 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:38Z" level=warning msg="Could not get VM state from agent" err="dial unix /var/snap/lxd/common/lxd/logs/bootspeed-kvm-t2large-latest/qemu.monitor: connect: no such file or directory" instance=bootspeed-kvm-t2large-latest instanceType=virtual-machine project=default
Jun 07 20:50:38 kecleon lxd.daemon[2380987]: time="2023-06-07T20:50:38Z" level=warning msg="Failed getting host interface state for MTU" device=eth0 driver=nic err="route ip+net: invalid network interface name" host_name= instance=bootspeed-kvm-t2large-latest project=default

A new launch, no matter if vm or container now fails immediately.

$ lxc launch ubuntu-minimal-daily:jammy metric-server-simple-jammy-vm-c1-m1 --ephemeral --vm
Creating metric-server-simple-jammy-vm-c1-m1
Error: Failed instance creation: LXD is shutting down

Also just as reported before lxc list hangs.
Also restarting snap.lxd.daemon.service is stuck.

We are unsure if we can switch all systems to debug mode to catch more next time for the potential space or speed implications, but maybe the above already helps?

It would be useful to know if you are able to get into this state with LXD 5.15 which is coming out next week (or latest/edge channel) if you’re able to test on a fresh system and dont need to downgrade back to 5.0 LTS.

Because there has been some work recently to detect hung VMs and return their status as ERROR so that LXD can then force stop them.

if you’re able to test on a fresh system and dont need to downgrade back to 5.0 LTS.

That does not apply, but we regularly keep our lxd updated.
Furthermore we do not know exactly what triggered it and so far only seen it once.

But while we can not pre-test the new version we will update in a while once it released and let you know if/when we might see it ever again. Base on your coordination with Paride this morning, we have switched on some limited extra debug to have better data in that case.

Thanks for your continuous responsiveness to help with all that!

1 Like