LXD hangs due to sequence of start/stop events

I have been trying to benchmark LXD container and VM creation cycles and have come accross a few issues. The script looks something like this:

UBUNTU_VM_IMAGE='ubuntu:20.04'
UBUNTU_VM_NAME='ubuntu-vm-test'

NUM_VMS=5

TOTAL_CREATE_RUNTIME=0
TOTAL_SHUTDOWN_RUNTIME=0
TOTAL_BOOT_RUNTIME=0
TOTAL_REBOOT_RUNTIME=0
TOTAL_DELETE_RUNTIME=0

echo "Started VM creation"
i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    CREATE_START=`date +%s.%N`
    lxc launch $UBUNTU_VM_IMAGE $UBUNTU_VM_NAME-$i --vm
    CREATE_END=`date +%s.%N`
    TOTAL_CREATE_RUNTIME=$( echo "$CREATE_END - $CREATE_START + $TOTAL_CREATE_RUNTIME" | bc -l )
    i=$(($i + 1))
done
echo "Ended VM creation"

# This sleep makes it so that the first iteration of lxc stop on the following loop gets stuck
# sleep 20

echo "Started VM shutdown"
i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    SHUTDOWN_START=`date +%s.%N`
    lxc stop $UBUNTU_VM_NAME-$i --verbose
    SHUTDOWN_END=`date +%s.%N`
    TOTAL_SHUTDOWN_RUNTIME=$( echo "$SHUTDOWN_END - $SHUTDOWN_START + $TOTAL_SHUTDOWN_RUNTIME" | bc -l )
    i=$(($i + 1))
done
echo $TOTAL_SHUTDOWN_RUNTIME
echo "Ended VM shutdown"

echo "Started VM boot"
i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    BOOT_START=`date +%s.%N`
    lxc start $UBUNTU_VM_NAME-$i
    BOOT_END=`date +%s.%N`
    TOTAL_BOOT_RUNTIME=$( echo "$BOOT_END - $BOOT_START  + $TOTAL_BOOT_RUNTIME" | bc -l )
    i=$(($i + 1))
done
echo "Ended VM boot"

echo "Started VM reboot"
i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    REBOOT_START=`date +%s.%N`
    lxc restart $UBUNTU_VM_NAME-$i
    REBOOT_END=`date +%s.%N`
    TOTAL_REBOOT_RUNTIME=$( echo "$REBOOT_END - $REBOOT_START  + $TOTAL_REBOOT_RUNTIME" | bc -l )
    i=$(($i + 1))
done
echo "Ended VM reboot"

i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    lxc stop $UBUNTU_VM_NAME-$i
    i=$(($i + 1))
done

echo "Started VM deletion"
i=0
while [ "$i" -lt "$NUM_VMS" ]; do
    DELETE_START=`date +%s.%N`
    lxc delete $UBUNTU_VM_NAME-$i
    DELETE_END=`date +%s.%N`
    TOTAL_DELETE_RUNTIME=$( echo "$DELETE_END - $DELETE_START  + $TOTAL_DELETE_RUNTIME" | bc -l )
    i=$(($i + 1))
done
echo "Ended VM deletion"

I noticed that when an LXD stop event was issued right after an lxc launch/start sometimes LXD would hang on a stop operation until I Ctrl+C out of it.

I have attempted to place an artificial wait time using sleep, however it seems that when sleep is placed before the stop loop, the first iteration always hangs on lxc stop for some reason until manually stopped, after which it executes the following loops with no issues.

Note that this issue only occurs on VM instances, I have the same script used for containers and the sleep prior to the stop loop does not affect its execution.

I am running LXD 4.8, though have also tested on a machine that’s tracking the latest/edge channel and the same problem occurs.

Any help is greatly appreciated.

After some more testing, I have noticed that when time sleep timer is high enough, all the stop operations are executed without any issues.

It seems that when a stop operation is issued during VM initialization, it waits for an indefinite amount of time until the operation is manually stopped.

However when I run the following 2 commands in sequence:

#!/bin/bash
lxc launch ubuntu:20.04 vm-test --vm
lxc stop vm-test

the VM is stopped with no issues.

When doing the same for containers:

#!/bin/bash
lxc launch ubuntu:20.04 container-test
lxc stop container-test

the lxc stop hangs but this time for containers.

Do you see the same when doing lxc stop -f <instance>?

It may be your container is not responding to the shutdown request. The -f forces it to end.

lxc stop only sends a signal to the init system, it does not force the container to stop.
If the init system isn’t yet ready to process that signal, the container keeps running.

Use lxc stop --force if you want to ensure the container dies regardless of what the init system thinks (can cause data loss on normal workloads though).

Also, you may want to look at lxd.benchmark which does pretty much exactly what you’re describing and comes with the LXD snap.

Force does work, however I was trying to simulate a normal instance shutdown which seems to have caused the issue.

I was unaware of the benchmark tool, I will look further into it. Thank you!

It seems lxd.benchmark only works for container images, I will change my script to better fit a VM test case.