VMs mysteriously stopped

Hello all,

I’ve been noticing something strange where all the VMs on my cluster get stopped.
The setup is such that there are 3 cluster members and the storage is ceph. There are only 3 VM instances running and they are spread out as one per node. LXD has been installed via snap and is 5.4-1ff8d34.

Now on to what actually happens: Once in a blue-moon (I’ve only caught it 2 times) all my VMs go into the “STOPPED” state. While this is going on the cluster is perfectly fine and all members show online, but the log in /var/snap/lxd/common/lxd/log/lxd.log showed:

time="2022-08-08T00:00:19Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://10.100.222.8:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="10.100.222.8:8443"
time="2022-08-08T00:00:28Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://10.100.222.8:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="10.100.222.8:8443"
time="2022-08-08T00:01:11Z" level=warning msg="Failed heartbeat" err="Failed to send heartbeat request: Put \"https://10.100.222.7:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" remote="10.100.222.7:8443"

This is from cluster member 10.100.222.6 (the others are .7 and .8). Interestingly /var/snap/lxd/common/lxd/logs/Instance_Name/qemu.log are gone so no logs can be obtained. Does anybody know if this is log rotated in any way? Last I checked them though there was only this:

VM: 9p: degraded performance: a reasonable high msize should be chosen on client/guest side 

But those files didn’t contain any time stamp so this could be old.

On to the ceph side no issues are shown and the cluster, pools, osds etc are healthy. System logs on the nodes themselves (syslog and journalctl) show nothing out of the ordinary.

I know this isn’t much to go off and since it is a bit of a sporadic issue I am a bit out of ideas where else to look. Any pointers on what to do should it happen again (or if anybody actually has had this) would be appreciated!

The heartbeat errors suggest a network issue or response time issue (perhaps high load on the target server)? Heartbeats are sent from the leader server (you can see which one is the leader using lxc cluster ls).

Are there still associated qemu processes on the host for the VMs showing as stopped?
And what do you have to do to get your VMs started again?

My monitoring hasn’t caught anything of that nature though there could be a network blip which I’m not detecting. Will keep a look out for that as well as the qemu processes as for the past couple of days they have been running. What is weird is this is all 3 VMs, where each is on a separate host. I didn’t think it will stop the actual instances even if an individual member is kicked out of the cluster or am I mistaken?

Regarding getting them started again all that seems to be required is to simply issue an “lxc start” command.

Perhaps the oom killed them and lxd and lxd restarted?

I thought the same, but the systemd services for lxd have been running all this time and there are no logs indicating OOM killer did anything.

Its perfectly possible for the OOM to have killed the qemu process and left LXD running unaffected.

That could explain why you see the VMs stopped, but not the heartbeat errors in LXD.

I concur, however that should at least log something to syslog or journalctl and both are empty.

Yep. I’m afraid I’m at a loss to explain it and certainly never experienced that before.

Okay so it happened again. Qemu-kvm and lxd are both working just fine and have been for some time according to journald and there was no OOM or networking event. Interestingly enough the logs for the VMs still appear to be empty though (if somebody has more info on how I can find what’s happening with those logs that would be great) and all I see in the general log around the supposed time of the stoppage is:

time=“2022-08-26T19:48:12Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put “https://10.100.222.8:8443/internal/database”: dial tcp 10.100.222.8:8443: connect: connection refused” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:12Z” level=warning msg=“Failed adding member event listener client” err=“Unable to connect to: 10.100.222.8:8443 ([dial tcp 10.100.222.8:8443: connect: connection refused])” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:22Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put “https://10.100.222.8:8443/internal/database”: dial tcp 10.100.222.8:8443: connect: connection refused” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:24Z” level=warning msg=“Failed adding member event listener client” err=“Unable to connect to: 10.100.222.8:8443 ([dial tcp 10.100.222.8:8443: connect: connection refused])” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:34Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put “https://10.100.222.8:8443/internal/database”: dial tcp 10.100.222.8:8443: connect: connection refused” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:39Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put “https://10.100.222.8:8443/internal/database”: dial tcp 10.100.222.8:8443: connect: connection refused” remote=“10.100.222.8:8443”
time=“2022-08-26T19:48:52Z” level=warning msg=“Failed heartbeat” err=“Failed to send heartbeat request: Put “https://10.100.222.8:8443/internal/database”: dial tcp 10.100.222.8:8443: connect: connection refused” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:10Z” level=warning msg=“Failed adding member event listener client” err=“read tcp 10.100.222.6:55232->10.100.222.8:8443: i/o timeout” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:10Z” level=warning msg=“Heartbeat round duration greater than heartbeat interval” duration=11.370391548s interval=10s
time=“2022-08-26T19:49:15Z” level=warning msg=“Member is out-of-date with respect to other cluster members”
time=“2022-08-26T19:49:27Z” level=warning msg=“Failed adding member event listener client” err=“read tcp 10.100.222.6:55292->10.100.222.8:8443: i/o timeout” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:27Z” level=warning msg=“Heartbeat round duration greater than heartbeat interval” duration=17.22708249s interval=10s
time=“2022-08-26T19:49:36Z” level=warning msg=“Failed adding member event listener client” err=“read tcp 10.100.222.6:55328->10.100.222.8:8443: i/o timeout” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:45Z” level=warning msg=“Failed adding member event listener client” err=“read tcp 10.100.222.6:55366->10.100.222.8:8443: i/o timeout” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:58Z” level=warning msg=“Failed adding member event listener client” err=“read tcp 10.100.222.6:55406->10.100.222.8:8443: i/o timeout” local=“10.100.222.6:8443” remote=“10.100.222.8:8443”
time=“2022-08-26T19:49:58Z” level=warning msg=“Heartbeat round duration greater than heartbeat interval” duration=10.631323254s interval=10s
time=“2022-08-26T19:49:58Z” level=warning msg=“Failed to resolve warning” err=“Failed to resolve warnings: no more rows available”
time=“2022-08-26T19:49:58Z” level=warning msg=“Failed to resolve warning” err=“Failed to resolve warnings: no more rows available”
time=“2022-08-26T19:49:58Z” level=error msg=“Failed updating cluster heartbeats” err=“no more rows available”
time=“2022-08-26T19:49:58Z” level=warning msg=“Rejecting heartbeat request as shutting down”
time=“2022-08-26T19:49:58Z” level=warning msg=“Rejecting heartbeat request as shutting down”
time=“2022-08-26T19:51:24Z” level=warning msg=" - Couldn’t find the CGroup blkio.weight, disk priority will be ignored"
time=“2022-08-26T19:51:24Z” level=warning msg=" - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored"
time=“2022-08-26T19:51:25Z” level=warning msg=“Dqlite: attempt 1: server 10.100.222.6:8443: no known leader”
time=“2022-08-26T19:51:26Z” level=warning msg=“Failed to initialize fanotify, falling back on fsnotify” err=“Failed to initialize fanotify: invalid argument”
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Pruning leftover image files” err=“Operation not found” operation=5a0d26c5-a9eb-40fa-b18b-7441c6674f91 project= status=Success
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Cleaning up expired images” err=“Operation not found” operation=ca05d285-163d-4a08-9adb-532a47190e9d project= status=Success
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Cleaning up expired instance backups” err=“Operation not found” operation=63bf4684-beaf-48d0-9e8d-3bcedecbdaa9 project= status=Success
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Pruning resolved warnings” err=“Operation not found” operation=9238e116-733a-4be6-a502-ac9d6283f79d project= status=Success
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Updating images” err=“Operation not found” operation=5cd3b3b3-29d9-4f24-9388-c332bb4cfaf1 project= status=Success
time=“2022-08-26T19:51:31Z” level=warning msg=“Failed to delete operation” class=task description=“Expiring log files” err=“Operation not found” operation=7e80411e-987b-4df3-a7f8-ad66a4b80483 project= status=Success
time=“2022-08-26T19:51:33Z” level=warning msg=“Failed to delete operation” class=task description=“Updating instance types” err=“Operation not found” operation=32a90da7-c6ea-4254-ba6a-4df4f28db324 project= status=Success
time=“2022-08-28T15:15:09Z” level=error msg=“Error updating raft members” err=“Failed to begin transaction: context deadline exceeded”

Interestingly enough there is nothing for 2 days almost after the last problem. It seems to have also restarted as part of a snap update? That is fine but in my experience snap updates don’t result in dead instances afterwards. These have the autostart flag set to true as well.

You could enable debug logging by doing:

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon

on each cluster member.

But the LXD process will restart when snap is updating it (so seeing those messages is not unusual if other cluster members are also restarting). But this doesn’t stop VMs (we don’t stop instances as part of a snap updated).

What did you mean by “Qemu-kvm and lxd are both working just fine”? Presumably they aren’t because I thought qemu-kvm processes keep stopping unexpectedly, or am I missing something and the processes are still running?

Presumably they aren’t because I thought qemu-kvm processes keep stopping unexpectedly, or am I missing something and the processes are still running?

I meant that the VMs are stopping, yet the processes themselves for Qemu-kvm and lxd are working. I’ve enabled the debug logging on each member. Things are stable for now though, will have to wait for them to die again.

So the processes are still around after the VMs are showing as stopped?

Exactly. It is as if something is manually stopping them. That is why I asked if there is some further logging into individual instances and when they were stopped etc.

Next time it happens can you get a full sudo ps aux | grep qemu output and show lxc config show <instance> for each VM please.

Sorry for bump into this thread. I got same problem and try to figure out what happen. This is the output after I run the command that you suggest.

sudo ps aux | grep qemu

lxd       982454  3.6  1.6 2026596 1103120 ?     Sl   Sep06 259:04 /snap/lxd/22753/bin/qemu-system-x86_64 -S -name p-mars-digitalsign -uuid 7e52a078-44f3-42b8-9252-fd546cd90493 -daemonize -cpu host -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/p-mars-digitalsign/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/p-mars-digitalsign/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/p-mars-digitalsign/qemu.pid -D /var/snap/lxd/common/lxd/logs/p-mars-digitalsign/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd
ubuntu   1440252  0.0  0.0  16152  1068 pts/0    S+   05:37   0:00 grep --color=auto qemu

lxc config show p-mars-digitalsign

architecture: x86_64
config:
  image.architecture: amd64
  image.description: Centos 7 amd64 (20210720_07:08)
  image.os: Centos
  image.release: "7"
  image.serial: "20210720_07:08"
  image.type: disk-kvm.img
  image.variant: default
  snapshots.expiry: 3d
  volatile.base_image: cd00fcc172ade58afe970be742096c448ade6f9e3874d93712ba44c91da82345
  volatile.eth0.host_name: mac5976aace
  volatile.eth0.hwaddr: 00:16:3e:b2:12:42
  volatile.eth0.last_state.created: "false"
  volatile.last_state.power: RUNNING
  volatile.uuid: 7e52a078-44f3-42b8-9252-fd546cd90493
  volatile.vsock_id: "61"
devices:
  root:
    path: /
    pool: local
    size: 20GiB
    type: disk
ephemeral: false
profiles:
- esign-vm
stateful: false
description: ""

lxc list

+--------------------+---------+------+------+-----------------+-----------+----------+
|         NAME       |  STATE  | IPV4 | IPV6 |      TYPE       | SNAPSHOTS | LOCATION |
+--------------------+---------+------+------+-----------------+-----------+----------+
| d-mars-digitalsign | RUNNING |      |      | VIRTUAL-MACHINE | 3         | mars     |
+--------------------+---------+------+------+-----------------+-----------+----------+
| p-mars-digitalsign | STOPPED |      |      | VIRTUAL-MACHINE | 0         | deimos   |
+--------------------+---------+------+------+-----------------+-----------+----------+

Every time this happen, I always kill the qemu process with sudo kill <pid> then start the VM.

Sounds like QEMU is hanging/locking up. Can you configure SSH server inside the VM sothe next time it happens you can confirm whether or not you can SSH into the VM, as that will help identify if the entire VM has crashed/hung or whether is just the communication between LXD and the lxd-agent failing.

It seems a VM has stopped again, though only one in this case. No qemu process is showing up on the node where the VM is:

root@node03:~# ps aux | grep qemu
root     2426499  0.0  0.0   8160  2420 pts/1    S+   16:12   0:00 grep --color=auto qemu
root@node03:~#

Also it doesn’t appear to be the LXD agent as the VMs aren’t pinging either and none of the servers on them are reachable (SSH, Apache etc).

Here is the VM config for the one that stopped:

root@node03:~# lxc config show secure02
architecture: x86_64
config:
  boot.autostart: "true"
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (cloud) (20220127_07:43)
  image.name: ubuntu-focal-amd64-cloud-20220127_07:43
  image.os: ubuntu
  image.release: focal
  image.serial: "20220127_07:43"
  image.variant: cloud
  volatile.base_image: f9c4b6ac4b1ed136c8272467e6a11c21ee9d974f1aaef22334611175c58bf025
  volatile.cloud-init.instance-id: 39ac779d-dc50-4daf-a512-f8d50ef94a24
  volatile.enp5s0.host_name: tap49f9f580
  volatile.enp5s0.hwaddr: 00:16:3e:83:c4:2f
  volatile.last_state.power: RUNNING
  volatile.uuid: a4462ddb-8558-44d3-8688-c2914d0a3d26
  volatile.vsock_id: "40"
devices: {}
ephemeral: false
profiles:
- network
- secure02
stateful: false
description: ""

If the qemu process has gone then it suggests qemu itself has crashed or something on your host system (systemd-oomd or the kernel oom) has killed it.