[LXD] Online VM live-migration (QEMU to QEMU)

tomp · March 9, 2023, 11:46am


Project	LXD
Status	Implemented
Author(s)	@tomp
Approver(s)	@stgraber
Release	5.13
Internal ID	LX038

Abstract

Extend the migration API to allow for streaming the instance state directly to the target QEMU, eliminating the need to dump to disk first.

Rationale

LXD currently supports (limited) container live migration by integrating support for CRIU. CRIU supports performing several iterations of pre-dumping and syncing the container’s running state before finally freezing the container and transferring the final state to the target machine. The transfer of the state data is done by way of a dedicated websocket between source and target machines, and is independent of the container’s root filesystem volume.

In comparison LXD’s current support for VM live migration is quite different. Instead of using the dedicated state websocket between source and target machines, it instead performs a stateful stop of the VM (which writes the VM’s state to a file on the VM’s config volume), performs an offline migration and finally re-starts the VM statefully (restoring the running state).

This approach has several drawbacks:

No support for iterative transfer of state before the source VM is finally frozen and stopped.
Unnecessary disk I/O and storage requirements on both source and target when saving the VM state to a file.
VM spends longer amount of time unavailable as the state file is transferred and restored.

The idea with QEMU to QEMU VM live migration is to allow LXD to perform an initial sync of the root disk, then use the existing LXD migration state websocket to iteratively transfer the source QEMU process state to the target QEMU process before performing a final sync of the root disk and resuming the VM on the target machine.

This will avoid writing the VM’s state to a file and transferring it after the VM has been statefully stopped.

Specification

Design

Prerequisite work

Before the live QEMU to QEMU functionality was added there is some prerequisite work that is required to eliminate some technical debt that exists in the migration subsystem.

Get container live migration working again (CRIU)

In order to avoid regressing container live migration (using CRIU) when making changes to the migration subsystem to accommodate VM live migration we first needed to get container live migration working again (albeit it only in very restrictive scenarios as supported by CRIU). But at least this was sufficient to ensure that LXD’s use of CRIU was correct during migrations.

https://github.com/lxc/lxd/pull/11353

Instance driver push down

Currently the instance migration logic is not contained within the instance type drivers. Instead it is implemented in a single place with instance type specific logic intermingled with the common logic.
In order to extend the migration protocol to accommodate QEMU to QEMU state transfers we need to first “push down” the existing implementation into the individual instance drivers.

This will make it much easier (and safer) to reason about and make changes to the VM specific migration logic without affecting the container migration logic.

This work has been done in the following pull requests:

https://github.com/lxc/lxd/pull/11414
https://github.com/lxc/lxd/pull/11440

Intra-cluster member move functionality

Additionally the existing instance move within cluster members functionality (which is implemented in the instancePostClusteringMigrate and instancePostClusteringMigrateWithCeph functions) does not fully support container stateful migration properly (i.e using multi-stage state pre-dump).

Instead the current limited implementation is very similar to how VMs are currently statefully migrated:

Statefully stop the source instance.
Copy of source to target (potentially using temporary random instance name to avoid conflicts).
Deleting source instance.
Renaming temporary random instance to original name of source instance.
Performing a stateful start of moved instance.

This isn’t going to be sufficient to achieve QEMU to QEMU live VM migration between cluster members, as performing a stateful stop will result in the contents of the VM’s memory being written to the instance’s storage volume, and then copied. Rather than transferring the VM’s memory directly to the target.

In order to do that we need to instantiate the instance on the target before deleting it on the source (and have both running concurrently). And we need to do this using a single instance DB record.

So before we can add VM live migration to clusters we need to finish off live migration support for containers (and in doing so add the necessary plumbing to accommodate VM live migration).

This support has been added in the following pull-request. It updates instancePostClusteringMigrate to use the migration API for intra-cluster instance moves. It uses the cluster notification HTTP header hint to dest.CreateInstance() to allow the target cluster member to alter its behaviour when accepting the instance being migrated (such as not creating a new instance DB record).

https://github.com/lxc/lxd/pull/11457

Push down intra-cluster ceph instance moves into migration subsystem

In order to support VM live migration when moving instances that are backed by a remote shared ceph storage pool between cluster members we needed to have the VM’s state transmitted between cluster members without having to first saving the state to the ceph disk and then restoring it on the target.

The ceph instance intra-cluster move logic was implemented separately from the main migration subsystem so in order to re-use the live migration logic (that would come) we needed to first get ceph cluster member moves using the migration subsystem.

This required adding a way to get a “hint” to both source and target members that the migration being performed was an internal cluster member move and not a remote->remote migration. This was especially important for ceph instances because of the nature of the shared storage (and thus shared volume DB records).

https://github.com/lxc/lxd/pull/11494
https://github.com/lxc/lxd/pull/11519

Adding support for post-negotiation state connection establishment

Current versions of LXD expect all required sockets to be connected before negotiation of migration functionality starts. This is a problem because unlike container live migration (with CRIU) VM live migration is going to be storage driver dependent, and so we need to add support for establishing the state connection a after the feature negotiation step has taken place.

Thankfully current versions of LXD only require a state connection if there has been a state secret sent or if req.Live is true and the instance type is a container.

This meant it was possible to rework the migration connection management approach to allow for on-demand connection establishment, which for VM live migration could be delayed until after the negotiation was finished.

Care was taken to preserve the ordering of connection establishment before the negotiation for container and non-live VM migration so that backward compatibility
was maintained.

https://github.com/lxc/lxd/pull/11529

Live migration work

Device volatile keys during live same-name intra-cluster instance migrations on ceph

A problem that has been encountered when adding live migration support was to do with how device volatile config keys were set/unset during instance start and stop.

The problem is that when performing a VM live migration, QEMU must be running on both the source and target cluster members at the same time. This necessitates that the instance’s devices have been started on both members also. When the instance is started on the target, the devices will record their new volatile settings into the database. But this means that when the onStopHook runs for the source instance the settings it loads from the database may now not accurately reflect the state on the source cluster member (as they are the settings used on the target member). This results in device stop cleanup not occurring correctly.

Equally problematic, when the source instance devices stop they clear the volatile config from the database, which means when the instance on the target is stopped at some point in the future, it too will not properly clear up its devices, as its volatile settings have been cleared.

The solution I ended up with was:

Pass the cluster same-name hint to the migration source (this required refactoring how intra-cluster migration worked so that LXD doesn’t use a loopback API request when setting up the migration source).
Add support for disabling persisting volatile config changes to the DB by way of an instance driver level variable that changes how VolatileSet function works.
Storing a temporary reference to the old in-memory instance on the source so that its used by the onStop hook when the source instance stops, which allows the instance drivers to access the old volatile settings and perform correct cleanup.

This is included in this PR, which also added support for VM intra-cluster instance live QEMU to QEMU moves when using a shared ceph storage pool.

https://github.com/lxc/lxd/pull/11451

Add live migration support for non-shared storage

When migrating a VM between hosts that don’t use shared storage (i.e not an intra-cluster member move when using ceph storage) we need to support synchronizing the root disk whilst the VM guest is still running (or at least minimize the amount of time the guest is interrupted).

To achieve this the NBD server built in to QEMU will be used to export the target disk so that QEMU can synchronize to it over the existing LXD filesystem websocket.

The process to achieve this and to minimize guest interruption is as follows:
It is based primarily on an example from https://kashyapc.fedorapeople.org/virt/qemu/Live-Migration-drive-mirror+NBD-pause-before-switchover.txt, although it has been adapted to use a QCOW2 snapshot in order to use LXD’s existing storage migration functionality to transfer the bulk of the existing root disk whilst the guest is still running. It has also been adapted so it doesn’t use any deprecated QMP features.

On the LXD source host:

Set the pause-before-switchover migration capability. This allows the migration to be paused after the source QEMU releases the block devices but before the serialization of the device state, to avoid a race condition betweenmigration and blockdev-mirror.
Create a temporary QCOW2 file to be used as a CoW snapshot disk to store writes that occur inside the guest during the time that the storage driver is performing the initial transfer of the VM’s root disk. The maximum size of the QCOW2 file will be set to the size of the instance’s root disk, but the full size is not allocated immediately and it grows as needed. This will be created in the source instance’s config drive (the size of which is limited using the size.state property on the instance’s root disk). It will be deleted as soon as a file handle is opened to it and passed to QEMU to avoid this file being copied to the target as part of the storage driver migration process.
Create a snapshot of the main root disk, with CoW changes being written into the QCOW2 file. At this point writes to the main root disk should have stopped. But the QEMU guest is still running.
Perform a normal storage driver transfer of the instance to the target host. Because no writes should be happening to the root disk this should be consistent from the point of time the snapshot was taken.
Add the NBD target disk as a block node (but not a device that appears in the instance) to the source QEMU process. LXD will expect the target to have started QEMU and have exported the target disk by NBD by this point. We will use the existing filesystem websocket connection to join the NBD client on the source to the NBD server on the target.
Perform a block device mirror of just the snapshot to the target NBD disk (so just the changes since the snapshot was taken). The guest is still running at this time, but writes are being mirrored in write-blocking mode to both sides so write I/O operations slow down. Once the drives are in sync the task becomes ready.
Once the block device mirror is in ready state, LXD will start the VM state transfer over the state websocket, which pauses the VM guest.
Once the state transfer has completed the source VM enters pre-switchover state, where the VM guest is still paused.
At this point LXD will complete the block device mirror (see 1. ), and then continue the state migration until it is in completed state.

On the LXD target host the process is a lot simpler:

Create the target instance and volume DB records (the same as offline migration).
Receive the storage driver filesystem migration (the same as offline migration).
Start the QEMU process in migrate incoming mode (paused).
Start the NBD server exporting the instance’s root disk for writing.
Wait until incoming migration state has been received and start.

This is implemented in:
https://github.com/lxc/lxd/pull/11540

Add support for live VM cluster member moves to `ovn` NIC device:

When performing a live QEMU to QEMU cluster member move of a VM with an ovn NIC device the network connectivity was broken.

This is because:

The instance’s logical switch port was being removed when the instance was stopped on the source after migration completed.
The logical switch port was not having its requested-chassis option set to the new chassis.

This was fixed by:

Storing the LXD server name in the external-ids field on device Start() so that it can be compared on device Stop() and if its no-longer the local server name then logical switch port cleanup is skipped, as it indicates this port has been started on another host.
Setting the request-chassis port option in the device’s post-start hook so that once the migration has finished the logical switch port is transferred to the new target chassis.

https://github.com/lxc/lxd/pull/11573

API changes

This change will add a migration_vm_live API extension so it is possible to detect the availability of this option.

Although not CRIU related, the existing migration protocol has a field called CRIUType which indicates the method by which instance state will be migrated. We will reuse this field for indicating that live VM QEMU migration is supported by add a new migration type called CRIUType_VM_QEMU with a value of 3.

CLI changes

No CLI changes are expected.

Database changes

No database changes are expected.

Upgrade handling

The existing VM stateful migration technique will need to remain supported so that VMs can be migrated to/from older LXD installations that do not have support for QEMU to QEMU live migrations. So we will need to extend the migration protocol to negotiate support for true live migration.

Further information

There are further developments I would like to explore/change when/if time allows for it:

Add NIC ARP/NDP announcements on target so that traffic is redirected to new host immediately after migration completes. There appears to be some support for this in QEMU already although not clear if it covers IPv6 NDP as well as IPv4 ARP announcements. See announce-self and announce* in MigrationParameter.
Explore the benefits of using some of the migration capabilities specifically interesting looking ones include: zero-blocks, compress, auto-converge, validate-uuid, zero-copy-send. Also background-snapshot may be of interest for use with stateful snapshots (as well as some of the other capabilities listed here). See also the compress* settings in MigrationParameter
Investigate the multi-chassis ovn NIC features in OVN 22.06 to reduce live migration activation time https://www.youtube.com/watch?v=ijZTMXAg-eI

Ibragim_Ganizade · March 29, 2023, 11:44am

Very cool that Canonical is actively working on a live migration to LXD for both VM and CT!

I wonder how you can enable CRIU without using Snap?
snap set lxd criu.enable=true

what if I build LXD from source?

to be honest i use alpine linux with criu installed (apk add criu)

Regards.