Project | LXD |
Status | Draft |
Author(s) | @tomp |
Approver(s) | @stgraber |
Release | First release with the feature |
Internal ID | LX038 |
Abstract
Extend the migration API to allow for streaming the instance state directly to the target QEMU, eliminating the need to dump to disk first.
Rationale
LXD currently supports (limited) container live migration by integrating support for CRIU. CRIU supports performing several iterations of pre-dumping and syncing the container’s running state before finally freezing the container and transferring the final state to the target machine. The transfer of the state data is done by way of a dedicated websocket between source and target machines, and is independent of the container’s root filesystem volume.
In comparison LXD’s current support for VM live migration is quite different. Instead of using the dedicated state websocket between source and target machines, it instead performs a stateful stop of the VM (which writes the VM’s state to a file on the VM’s config volume), performs an offline migration and finally re-starts the VM statefully (restoring the running state).
This approach has several drawbacks:
- No support for iterative transfer of state before the source VM is finally frozen and stopped.
- Unnecessary disk I/O and storage requirements on both source and target when saving the VM state to a file.
- VM spends longer amount of time unavailable as the state file is transferred and restored.
The idea with QEMU to QEMU VM live migration is to allow LXD to perform an initial sync of the root disk, then use the existing LXD migration state websocket to iteratively transfer the source QEMU process state to the target QEMU process before performing a final sync of the root disk and resuming the VM on the target machine.
This will avoid writing the VM’s state to a file and transferring it after the VM has been statefully stopped.
Specification
Design
A long form description of the proposed changes, expanding on what’s in the abstract and covering why this approach is taken.
Prerequisite work
Instance driver push down:
Currently the instance migration logic is not contained within the instance type drivers. Instead it is implemented in a single place with instance type specific logic intermingled with the common logic.
In order to extend the migration protocol to accommodate QEMU to QEMU state transfers we need to first “push down” the existing implementation into the individual instance drivers.
This will make it much easier (and safer) to reason about and make changes to the VM specific migration logic without affecting the container migration logic.
This work has been done in the following pull requests:
Intra-cluster member move functionality:
Additionally the existing instance move within cluster members functionality (which is implemented in the instancePostClusteringMigrate
and instancePostClusteringMigrateWithCeph
functions) does not fully support container stateful migration properly (i.e using multi-stage state pre-dump).
Instead the current limited implementation is very similar to how VMs are currently statefully migrated:
- Statefully stop the source instance.
- Copy of source to target (potentially using temporary random instance name to avoid conflicts).
- Deleting source instance.
- Renaming temporary random instance to original name of source instance.
- Performing a stateful start of moved instance.
This isn’t going to be sufficient to achieve QEMU to QEMU live VM migration between cluster members, as performing a stateful stop will result in the contents of the VM’s memory being written to the instance’s storage volume, and then copied. Rather than transferring the VM’s memory directly to the target.
In order to do that we need to instantiate the instance on the target before deleting it on the source (and have both running concurrently). And we need to do this using a single instance DB record.
So before we can add VM live migration to clusters we need to finish off live migration support for containers (and in doing so add the necessary plumbing to accommodate VM live migration).
This support has been added in the following pull-request. It updates instancePostClusteringMigrate
to use the migration API for intra-cluster instance moves. It uses the cluster notification HTTP header hint to dest.CreateInstance()
to allow the target cluster member to alter its behaviour when accepting the instance being migrated (such as not creating a new instance DB record).
Device volatile keys during live same-name intra-cluster instance migrations
A problem that has been encountered when adding live migration support was to do with how device volatile config keys were set/unset during instance start and stop.
The problem is that when performing a VM live migration, QEMU must be running on both the source and target cluster members at the same time. This necessitates that the instance’s devices have been started on both members also. When the instance is started on the target, the devices will record their new volatile settings into the database. But this means that when the onStopHook
runs for the source instance the settings it loads from the database may now not accurately reflect the state on the source cluster member (as they are the settings used on the target member). This results in device stop cleanup not occurring correctly.
Equally problematic, when the source instance devices stop they clear the volatile config from the database, which means when the instance on the target is stopped at some point in the future, it too will not properly clear up its devices, as its volatile settings have been cleared.
The solution I ended up with was:
- Pass the cluster same-name hint to the migration source (this required refactoring how intra-cluster migration worked so that LXD doesn’t use a loopback API request when setting up the migration source).
- Add support for disabling persisting volatile config changes to the DB by way of an instance driver level variable that changes how
VolatileSet
function works. - Storing a temporary reference to the old in-memory instance on the source so that its used by the
onStop
hook when the source instance stops, which allows the instance drivers to access the old volatile settings and perform correct cleanup.
API changes
If applicable, any change to the user accessible API that will be made.
This may cover REST API changes, changes to public symbols in a library, additional exported functions in Go client packages, …
CLI changes
No CLI changes are expected.
Database changes
No database changes are expected.
Upgrade handling
The existing VM stateful migration technique will need to remain supported so that VMs can be migrated to/from older LXD installations that do not have support for QEMU to QEMU live migrations. So we will need to extend the migration protocol to negotiate (on a per-storage driver basis) support for true live migration.
Further information
Additional information on the particular design decisions made in the document, alternative designs that may have been considered as well as links to additional information.