Live Migration Failure Due To Data Corruption

sidaf · January 8, 2025, 2:36pm

Hello,

I am attempting to perform migration of virtual machines between hosts in an Incus cluster. To facilitate live migration I have set the configuration setting “migration.stateful” to true and set a “size.state” equal or larger than the memory size on each vm.

$ incus config show blah
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian bookworm amd64 (20250108_05:24)
  image.os: Debian
  image.release: bookworm
  image.serial: "20250108_05:24"
  image.type: disk-kvm.img
  image.variant: cloud
  migration.stateful: "true"
  security.secureboot: "false"
  volatile.apply_template: create
  volatile.base_image: 8598d695d7e481e6f97129a7a608aff72958fac6b0cbf866cb483de8ddb6bafc
  volatile.cloud-init.instance-id: b5daa185-8460-4baf-9d3b-2ed9eafe6b63
  volatile.eth0.hwaddr: 00:16:3e:7f:47:32
  volatile.uuid: fef4cfdd-478d-4ac0-b80a-88b843115138
  volatile.uuid.generation: fef4cfdd-478d-4ac0-b80a-88b843115138
devices:
  root:
    path: /
    pool: local
    size.state: 1GiB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

When attempting to perform a live migration I receive a message as per the example below:

$ incus move blah --target redux
Error: Migration operation failure: Instance move to destination failed on source: Failed migration on source: Error from migration control target: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 -- /opt/incus/bin/qemu-system-x86_64 -S -name blah -uuid 925ad85e-9c6f-46a9-b486-d9c2b1941556 -daemonize -cpu kvm64,tsc_scale,perfctr_core,rdseed,smap,sse4a,smep,clflushopt,clwb,flushbyasid,pdpe1gb,rdtscp,rdrand,popcnt,osvw,svm_lock,vmcb_clean,sse4_2,3dnowprefetch,stibp,bmi1,misalignsse,movbe,lahf_lm,xsaveopt,svm,extapic,wdt,tce,avx2,avic,ssse3,ibpb,xgetbv1,xsaveerptr,mmxext,ibrs,clzero,vgif,cmp_legacy,ssbd,pclmulqdq,fma,topoext,bmi2,adx,umip,npt,ht,skinit,rdpid,arat,f16c,xsaves,lbrv,decodeassists,perfctr_nb,fsgsbase,pfthreshold,monitor,xsave,avx,aes,xsavec,nrip_save,sse4_1,abm,fxsr_opt,ibs,wbnoinvd -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /run/incus/blah/qemu.conf -spice unix=on,disable-ticketing=on,addr=/run/incus/blah/qemu.spice -pidfile /run/incus/blah/qemu.pid -D /var/log/incus/blah/qemu.log -incoming defer -smbios type=2,manufacturer=LinuxContainers,product=Incus: : exit status 1

Within the system logs of the target host I see the following errors:

Jan 08 13:42:43 redux zed[27824]: eid=136 class=data pool='incus' priority=0 err=5 flags=0x180 bookmark=444:1:0:0
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 0, async page read
Jan 08 13:42:43 redux zed[27835]: eid=137 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 0, async page read
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 1, async page read
Jan 08 13:42:43 redux zed[27838]: eid=138 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27843]: eid=139 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27849]: eid=140 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27853]: eid=141 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux (udev-worker)[27826]: zd0: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/zd0' failed with exit code 1.
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link UP
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered blocking state
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered disabled state
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: entered allmulticast mode
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: entered promiscuous mode
Jan 08 13:42:43 redux zed[27886]: eid=144 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=428:131:0:0
Jan 08 13:42:43 redux systemd[1]: var-lib-incus-devices-blah-config.mount.mount: Deactivated successfully.
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link UP
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: left allmulticast mode
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: left promiscuous mode
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered disabled state
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link DOWN
Jan 08 13:42:43 redux systemd[1]: var-lib-incus-storage\x2dpools-local-virtual\x2dmachines-blah.mount: Deactivated successfully.
Jan 08 13:42:44 redux incusd[1903]: time="2025-01-08T13:42:44Z" level=error msg="Failed migration on target" clusterMoveSourceName=blah err="Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 -- /opt/incus/bin/qemu-system-x86_64 -S -name blah -uuid 925ad85e-9c6f-46a9-b486-d9c2b1941556 -daemonize -cpu kvm64,tsc_scale,perfctr_core,rdseed,smap,sse4a,smep,clflushopt,clwb,flushbyasid,pdpe1gb,rdtscp,rdrand,popcnt,osvw,svm_lock,vmcb_clean,sse4_2,3dnowprefetch,stibp,bmi1,misalignsse,movbe,lahf_lm,xsaveopt,svm,extapic,wdt,tce,avx2,avic,ssse3,ibpb,xgetbv1,xsaveerptr,mmxext,ibrs,clzero,vgif,cmp_legacy,ssbd,pclmulqdq,fma,topoext,bmi2,adx,umip,npt,ht,skinit,rdpid,arat,f16c,xsaves,lbrv,decodeassists,perfctr_nb,fsgsbase,pfthreshold,monitor,xsave,avx,aes,xsavec,nrip_save,sse4_1,abm,fxsr_opt,ibs,wbnoinvd -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /run/incus/blah/qemu.conf -spice unix=on,disable-ticketing=on,addr=/run/incus/blah/qemu.spice -pidfile /run/incus/blah/qemu.pid -D /var/log/incus/blah/qemu.log -incoming defer -smbios type=2,manufacturer=LinuxContainers,product=Incus: : exit status 1" instance=blah live=true project=default push=false

So it seems like the transferred zfs dataset is corrupted and the migration fails. A zpool status reports no issues, likely because the dataset no longer exists as it has been removed by incus when the migration failed.

# zpool status
  pool: incus
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Wed Jan  8 11:00:42 2025
config:

        NAME                                          STATE     READ WRITE CKSUM
        incus                                         ONLINE       0     0     0                                                                                                                                  mirror-0                                    ONLINE       0     0     0
            nvme-WDC_WDS200T2B0C-00PXH0_2114G2448408  ONLINE       0     0     0
            nvme-WD_Red_SN700_2000GB_21330M800791     ONLINE       0     0     0
          mirror-1                                    ONLINE       0     0     0
            nvme-WD_Red_SN700_2000GB_23202J800063     ONLINE       0     0     0
            nvme-WD_Red_SN700_2000GB_24370G800955     ONLINE       0     0     0

errors: No known data errors

However, attempting to migrate a vm (with and without the state configuration applied) in a stopped state seems to work without issue.

Hosts are identical and are using incus, kernel and zfs packages from the zabbly repo. Details are:

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble
# uname -a
Linux noctua 6.12.8-zabbly+ #ubuntu24.04 SMP PREEMPT_DYNAMIC Fri Jan  3 17:01:02 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
# zfs version
zfs-2.2.7-1
# incus --version
6.8
# cat /proc/cpuinfo | grep "model name" | tail -1
model name      : AMD EPYC 7302 16-Core Processor

Does anybody have any ideas on how this data corruption issue can be resolved?

stgraber · January 8, 2025, 3:37pm

Given the error is about QEMU failing to start, I’d probably start by looking into that.
Can you look at /var/log/incus/ on the target for any log files related to the instance you’re trying to migrate?

Hopefully there will be a qemu.log giving us a bit more detail than just QEMU exiting with a non-zero value.

sidaf · January 8, 2025, 3:59pm

Yes of course, contents of the qemu.log file from the target host:

# cat blah/qemu.log
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.monitor [bit 3]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.extapic [bit 3]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.ibs [bit 10]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.skinit [bit 12]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.wdt [bit 13]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.tce [bit 17]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.perfctr-nb [bit 24]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.8000000AH:EDX.svm-lock [bit 2]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.8000000AH:EDX.decodeassists [bit 7]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.8000000AH:EDX.avic [bit 13]
qemu-system-x86_64: can't read pflash1 block backend for cfi.pflash01 device '/machine/system.flash1': Input/output error

stgraber · January 8, 2025, 7:50pm

Right, so that does show an input/output error, apparently when reading the QEMU firmware file…

Are both source and target servers on the same distro with the same version of Incus installed from the same package source?

sidaf · January 8, 2025, 8:57pm

Yes, source and target hosts are identical - same hardware, Ubuntu 24.04 with latest updates and using the incus, kernel, and zfs packages from the zabbly repo.

stgraber · January 8, 2025, 11:26pm

Very weird. Do you have the same issue if you create a new very simple VM and try to live-migrate that?

sidaf · January 9, 2025, 12:15pm

I’ve also tried the alpine and ubuntu images, but same result unfortunately - all were new “pristine” vms.

stgraber · January 9, 2025, 10:28pm

Okay, so that’s a cluster of at least two servers, running Ubuntu 24.04 with Incus 6.8 from Zabbly repo, using ZFS local storage and trying to live-migrate a VM between the two?

sidaf · January 9, 2025, 10:54pm

Correct

sidaf · January 11, 2025, 6:59pm

Finally figured out the issue, I had set the following zfs module parameter zfs_compressed_arc_enabled=0. Removing this (and allowing ARC compression) allows me to live migrate without issues. No idea why this causes an issue for this particular use case, but I thought i’d share in case anybody else hits the same issue.