VM management operations is slow in LXD

VM operations(start, stop and restart) on LDX seems to be much slower than VM with KVM/libvirt.
My configuration is VM’s memory is 256GB and one of PCI device (Infiniband) is attached as PCI passthrough in both LXD and KVM/libvirt cases.
In that case, when VM starts, all assigned RAM to VM need to be allocated with qemu process which takes some time.

Here is VM setup on LXD.

# lxc config show r8-lxd
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Rockylinux 8 amd64 (20230124_02:06)
  image.os: Rockylinux
  image.release: "8"
  image.serial: "20230124_02:06"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "32"
  limits.memory: 256GiB
  volatile.apply_template: create
  volatile.base_image: ee4a05eaa568a47e284cba4e442eea4187fa2adecf9b0856295d4f2db1e42457
  volatile.cloud-init.instance-id: 49107854-6846-4891-bbb7-eb635e66c7bd
  volatile.eth0.hwaddr: 00:16:3e:4a:9d:c6
  volatile.uuid: d3781c55-81cb-4119-8e13-e9e4ca66a228
devices:
  infiniband0:
    address: b1:00.0
    type: pci
ephemeral: false
profiles:
- default
stateful: false
description: ""

VM boot

KVM/libvirt

time virt-install --name=r8-kvm --vcpus=32 --ram=262144 \
--disk path=/var/lib/libvirt/images/r8-kvm.img,size=32,format=qcow2 \
--cdrom=/work/OS/RockyLinux/8.6/Rocky-8.6-x86_64-minimal.iso \
--network bridge:br0 --hostdev b1:00.0,address.type=pci \
--graphics vnc,port=5900,listen=0.0.0.0,password=123456 --debug
[Wed, 25 Jan 2023 11:31:39 virt-install 15151] DEBUG (cli:204) Launched with command line: /usr/bin/virt-install --name=r8-kvm --vcpus=32 --ram=262144 --disk path=/var/lib/libvirt/images/r8-kvm.img,size=32,format=qcow2 --cdrom=/work/OS/RockyLinux/8.6/Rocky-8.6-x86_64-minimal.iso --network bridge:br0 --hostdev b1:00.0,address.type=pci --graphics vnc,port=5900,listen=0.0.0.0,password=123456 --debug

-- snip - 
Starting install...
[Wed, 25 Jan 2023 11:31:40 virt-install 15151] DEBUG (installer:587) Generated initial_xml: 

-- snip -- 

Domain is still running. Installation may be in progress.
[Wed, 25 Jan 2023 11:32:30 virt-install 15151] DEBUG (cli:265) Waiting for the installation to complete.

It took ~50sec until VM up. (majority of time was allocating 256GB memory)

LXD

# time lxc start r8-lxd

real 1m27.905s
user 0m0.047s
sys 0m0.024s

VM force shutdown
KVM/libvirt

root@ice01:~# time virsh destroy r8-kvm
Domain 'r8-kvm' destroyed

real 0m3.346s
user 0m0.008s
sys 0m0.008s

LXD

# time lxc stop r8-lxd --force 
Error: Instance "stop" operation timed out after 30s
Try `lxc info --show-log r8-lxd` for more info

real 0m30.347s
user 0m0.036s
sys 0m0.036s

VM reboot

VM reboot is totally diferent behaviors.
with LXD, it takes several minutes until VM comes back again. Because VM frees all memory at shutdown, then allocate them to boot up again.
reboot VM on KVM/libvert, it finished in less than 1 min since it keeps all allocated 256GB RAM during reboot.

Run lxc monitor --type=logging --pretty in a separate window and then repeat the operations to see which steps are taking the time.

@tomp here is debug logs collected with lxc monitor.

VM boot

# lxc monitor --type=logging --pretty
DEBUG  [2023-01-25T12:41:55Z] Event listener server handler started         id=0d8b5da9-21b8-44cc-8cb7-3971fa23f405 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:42:06Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0 username=root
DEBUG  [2023-01-25T12:42:06Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/instances/r8-lxd username=root
DEBUG  [2023-01-25T12:42:06Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/events username=root
DEBUG  [2023-01-25T12:42:06Z] Handling API request                          ip=@ method=PUT protocol=unix url=/1.0/instances/r8-lxd/state username=root
DEBUG  [2023-01-25T12:42:06Z] Event listener server handler started         id=791e6450-2f6c-42b0-a4e3-ea3c18c7c9a7 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:42:06Z] Start started                                 instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:42:06Z] New operation                                 class=task description="Starting instance" operation=8bed658c-ed30-4f34-8784-027a630ef667 project=default
DEBUG  [2023-01-25T12:42:06Z] Started operation                             class=task description="Starting instance" operation=8bed658c-ed30-4f34-8784-027a630ef667 project=default
DEBUG  [2023-01-25T12:42:06Z] Instance operation lock created               action=start instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:42:06Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/operations/8bed658c-ed30-4f34-8784-027a630ef667 username=root
DEBUG  [2023-01-25T12:42:06Z] MountInstance started                         instance=r8-lxd project=default
DEBUG  [2023-01-25T12:42:06Z] Skipping lxd-agent install as unchanged       installPath=/var/snap/lxd/common/lxd/virtual-machines/r8-lxd/config/lxd-agent instance=r8-lxd instanceType=virtual-machine project=default srcPath=/snap/lxd/24323/bin/lxd-agent
DEBUG  [2023-01-25T12:42:06Z] MountInstance finished                        instance=r8-lxd project=default
DEBUG  [2023-01-25T12:42:06Z] Starting device                               device=eth0 instance=r8-lxd instanceType=virtual-machine project=default type=nic
DEBUG  [2023-01-25T12:42:06Z] Starting device                               device=infiniband0 instance=r8-lxd instanceType=virtual-machine project=default type=pci
DEBUG  [2023-01-25T12:42:06Z] Starting device                               device=root instance=r8-lxd instanceType=virtual-machine project=default type=disk
WARNING[2023-01-25T12:42:12Z] Using writeback cache I/O                     devPath=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/r8-lxd/root.img device=root fsType=btrfs instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:42:13Z] UpdateInstanceBackupFile started              instance=r8-lxd project=default
DEBUG  [2023-01-25T12:42:13Z] Skipping unmount as in use                    driver=btrfs pool=default refCount=1 volName=r8-lxd
DEBUG  [2023-01-25T12:42:13Z] UpdateInstanceBackupFile finished             instance=r8-lxd project=default
DEBUG  [2023-01-25T12:42:43Z] Instance operation lock finished              action=start err="Instance \"start\" operation timed out after 30s" instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:43:33Z] Success for operation                         class=task description="Starting instance" operation=8bed658c-ed30-4f34-8784-027a630ef667 project=default
DEBUG  [2023-01-25T12:43:33Z] Start finished                                instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:43:33Z] Event listener server handler stopped         listener=791e6450-2f6c-42b0-a4e3-ea3c18c7c9a7 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:43:47Z] Instance agent started                        instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:43:47Z] Sending request to LXD                        etag= method=GET url="https://custom.socket/1.0"
DEBUG  [2023-01-25T12:43:47Z] Connecting to a VM agent over a VM socket     DEBUG  [2023-01-25T12:43:48Z] Got response struct from LXD                  DEBUG  [2023-01-25T12:43:48Z] 

VM force shutdown

root@ice01:~# time lxc stop r8-lxd --force

DEBUG  [2023-01-25T12:45:05Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0 username=root
DEBUG  [2023-01-25T12:45:05Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/events username=root
DEBUG  [2023-01-25T12:45:05Z] Event listener server handler started         id=b3fe223c-61c1-4516-982e-9f4e89819c03 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:45:05Z] Handling API request                          ip=@ method=PUT protocol=unix url=/1.0/instances/r8-lxd/state username=root
DEBUG  [2023-01-25T12:45:05Z] Stop started                                  instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:45:05Z] New operation                                 class=task description="Stopping instance" operation=40f22676-baf2-48bd-9b97-c2cae3a8f45b project=default
DEBUG  [2023-01-25T12:45:05Z] Started operation                             class=task description="Stopping instance" operation=40f22676-baf2-48bd-9b97-c2cae3a8f45b project=default
DEBUG  [2023-01-25T12:45:05Z] Instance operation lock created               action=stop instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:45:05Z] Handling API request                          ip=@ method=GET protocol=unix url=/1.0/operations/40f22676-baf2-48bd-9b97-c2cae3a8f45b username=root
DEBUG  [2023-01-25T12:45:05Z] onStop hook started                           instance=r8-lxd instanceType=virtual-machine project=default target=stop
DEBUG  [2023-01-25T12:45:05Z] Waiting for VM process to finish              instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:45:05Z] Instance stopped                              instance=r8-lxd instanceType=virtual-machine project=default reason=host-qmp-quit target=stop
DEBUG  [2023-01-25T12:45:05Z] VM process finished                           instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:45:05Z] Stopping device                               device=infiniband0 instance=r8-lxd instanceType=virtual-machine project=default type=pci
DEBUG  [2023-01-25T12:45:35Z] Instance operation lock finished              action=stop err="Instance \"stop\" operation timed out after 30s" instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:45:35Z] Failure for operation                         class=task description="Stopping instance" err="Instance \"stop\" operation timed out after 30s" operation=40f22676-baf2-48bd-9b97-c2cae3a8f45b project=default
DEBUG  [2023-01-25T12:45:35Z] Stop finished                                 instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:45:35Z] Event listener server handler stopped         listener=b3fe223c-61c1-4516-982e-9f4e89819c03 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:45:49Z] Stopping device                               device=root instance=r8-lxd instanceType=virtual-machine project=default type=disk
DEBUG  [2023-01-25T12:45:49Z] Stopping device                               device=eth0 instance=r8-lxd instanceType=virtual-machine project=default type=nic
DEBUG  [2023-01-25T12:45:49Z] UnmountInstance started                       instance=r8-lxd project=default
DEBUG  [2023-01-25T12:45:49Z] UnmountInstance finished                      instance=r8-lxd project=default
DEBUG  [2023-01-25T12:45:49Z] onStop hook finished                          instance=r8-lxd instanceType=virtual-machine project=default target=stop

VM reboot

DEBUG  [2023-01-25T12:49:06Z] WebsocketRecvStream got error getting next reader  err="websocket: close 1006 (abnormal closure): unexpected EOF"
DEBUG  [2023-01-25T12:49:06Z] Instance process stopped                      PID=0 err="websocket: close 1006 (abnormal closure): unexpected EOF" exitStatus=-1 instance=r8-lxd interactive=true project=default
DEBUG  [2023-01-25T12:49:06Z] Exec control handler finished                 PID=0 instance=r8-lxd interactive=true project=default
DEBUG  [2023-01-25T12:49:06Z] WebsocketRecvStream got error getting next reader  err="websocket: close 1006 (abnormal closure): unexpected EOF"
DEBUG  [2023-01-25T12:49:06Z] Updated metadata for operation                class=websocket description="Executing command" operation=3a234de9-0cc3-451b-9e62-fc2b0dc90c9f project=default
DEBUG  [2023-01-25T12:49:06Z] Exec mirror websocket finished                PID=0 instance=r8-lxd interactive=true number=0 project=default
DEBUG  [2023-01-25T12:49:06Z] Failure for operation                         class=websocket description="Executing command" err="websocket: close 1006 (abnormal closure): unexpected EOF" operation=3a234de9-0cc3-451b-9e62-fc2b0dc90c9f project=default
DEBUG  [2023-01-25T12:49:06Z] Event listener server handler stopped         listener=6e59f133-6832-437a-bc1d-8f6b0677fdc2 local=/var/snap/lxd/common/lxd/unix.socket remote=@
DEBUG  [2023-01-25T12:49:06Z] onStop hook started                           instance=r8-lxd instanceType=virtual-machine project=default target=reboot
DEBUG  [2023-01-25T12:49:06Z] Instance initiated stop                       action=reboot instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:49:06Z] Instance operation lock created               action=restart instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:49:06Z] Instance stopped                              instance=r8-lxd instanceType=virtual-machine project=default reason=guest-reset target=reboot
DEBUG  [2023-01-25T12:49:06Z] Waiting for VM process to finish              instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:49:06Z] VM process finished                           instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:49:06Z] Stopping device                               device=infiniband0 instance=r8-lxd instanceType=virtual-machine project=default type=pci
DEBUG  [2023-01-25T12:49:36Z] Instance operation lock finished              action=restart err="Instance \"restart\" operation timed out after 30s" instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:49:50Z] Stopping device                               device=root instance=r8-lxd instanceType=virtual-machine project=default type=disk
DEBUG  [2023-01-25T12:49:50Z] Stopping device                               device=eth0 instance=r8-lxd instanceType=virtual-machine project=default type=nic
DEBUG  [2023-01-25T12:49:50Z] UnmountInstance started                       instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:50Z] MountInstance started                         instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:50Z] Start started                                 instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:49:50Z] UnmountInstance finished                      instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:50Z] Instance operation lock created               action=start instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:49:50Z] Skipping lxd-agent install as unchanged       installPath=/var/snap/lxd/common/lxd/virtual-machines/r8-lxd/config/lxd-agent instance=r8-lxd instanceType=virtual-machine project=default srcPath=/snap/lxd/24323/bin/lxd-agent
DEBUG  [2023-01-25T12:49:50Z] MountInstance finished                        instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:50Z] Starting device                               device=eth0 instance=r8-lxd instanceType=virtual-machine project=default type=nic
DEBUG  [2023-01-25T12:49:50Z] Starting device                               device=infiniband0 instance=r8-lxd instanceType=virtual-machine project=default type=pci
DEBUG  [2023-01-25T12:49:50Z] Starting device                               device=root instance=r8-lxd instanceType=virtual-machine project=default type=disk
WARNING[2023-01-25T12:49:55Z] Using writeback cache I/O                     devPath=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/r8-lxd/root.img device=root fsType=btrfs instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:49:56Z] UpdateInstanceBackupFile started              instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:56Z] UpdateInstanceBackupFile finished             instance=r8-lxd project=default
DEBUG  [2023-01-25T12:49:56Z] Skipping unmount as in use                    driver=btrfs pool=default refCount=1 volName=r8-lxd
DEBUG  [2023-01-25T12:50:26Z] Instance operation lock finished              action=start err="Instance \"start\" operation timed out after 30s" instance=r8-lxd project=default reusable=false
DEBUG  [2023-01-25T12:51:15Z] Start finished                                instance=r8-lxd instanceType=virtual-machine project=default stateful=false
DEBUG  [2023-01-25T12:51:15Z] onStop hook finished                          instance=r8-lxd instanceType=virtual-machine project=default target=reboot
DEBUG  [2023-01-25T12:51:35Z] Instance agent started                        instance=r8-lxd instanceType=virtual-machine project=default
DEBUG  [2023-01-25T12:51:35Z] Sending request to LXD                        etag= method=GET url="https://custom.socket/1.0"
DEBUG  [2023-01-25T12:51:35Z] Connecting to a VM agent over a VM socket     DEBUG  [2023-01-25T12:51:35Z] 

What version of LXD?

Looks like in the forced stop and reboot case we can see the infiniband pci device was taking a very long time to stop.

Are you able to try the same VM config without the infiniband device to see if thats the only issue?

What version of LXD?

# lxd --version 
5.10

Are you able to try the same VM config without the infiniband device to see if thats the only issue?

Yes, Without Infiniband is fast.

root@ice01:~# lxc config device remove r8-lxd infiniband0
Device infiniband0 removed from r8-lxd
root@ice01:~# time lxc start r8-lxd

real 0m4.866s
user 0m0.028s
sys 0m0.044s
root@ice01:~# time lxc stop r8-lxd -f

real 0m0.696s
user 0m0.028s
sys 0m0.048s

However, Infiniband doesn’t matter. it might be related to any PCI devices.
Ran same test, but other PCI device (PCI Optane device) instead of Infiniband

# lspci | grep SSD
98:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]

# lxc config device add r8-lxd ssd pci address=98:00.0

# time lxc start r8-lxd

real 1m22.019s
user 0m0.031s

sys 0m0.043s

# time lxc stop r8-lxd -f

real 0m17.691s
user 0m0.025s
sys 0m0.047s

It’s a bit faster, but still longer than KVM/libvert.

What sort of infiniband device are you passing in?
Could you pass it via https://linuxcontainers.org/lxd/docs/master/reference/devices_infiniband/

It maybe it helps us understand what the delay is.