Hello,
I am attempting to perform migration of virtual machines between hosts in an Incus cluster. To facilitate live migration I have set the configuration setting “migration.stateful” to true and set a “size.state” equal or larger than the memory size on each vm.
$ incus config show blah
architecture: x86_64
config:
image.architecture: amd64
image.description: Debian bookworm amd64 (20250108_05:24)
image.os: Debian
image.release: bookworm
image.serial: "20250108_05:24"
image.type: disk-kvm.img
image.variant: cloud
migration.stateful: "true"
security.secureboot: "false"
volatile.apply_template: create
volatile.base_image: 8598d695d7e481e6f97129a7a608aff72958fac6b0cbf866cb483de8ddb6bafc
volatile.cloud-init.instance-id: b5daa185-8460-4baf-9d3b-2ed9eafe6b63
volatile.eth0.hwaddr: 00:16:3e:7f:47:32
volatile.uuid: fef4cfdd-478d-4ac0-b80a-88b843115138
volatile.uuid.generation: fef4cfdd-478d-4ac0-b80a-88b843115138
devices:
root:
path: /
pool: local
size.state: 1GiB
type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
When attempting to perform a live migration I receive a message as per the example below:
$ incus move blah --target redux
Error: Migration operation failure: Instance move to destination failed on source: Failed migration on source: Error from migration control target: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 -- /opt/incus/bin/qemu-system-x86_64 -S -name blah -uuid 925ad85e-9c6f-46a9-b486-d9c2b1941556 -daemonize -cpu kvm64,tsc_scale,perfctr_core,rdseed,smap,sse4a,smep,clflushopt,clwb,flushbyasid,pdpe1gb,rdtscp,rdrand,popcnt,osvw,svm_lock,vmcb_clean,sse4_2,3dnowprefetch,stibp,bmi1,misalignsse,movbe,lahf_lm,xsaveopt,svm,extapic,wdt,tce,avx2,avic,ssse3,ibpb,xgetbv1,xsaveerptr,mmxext,ibrs,clzero,vgif,cmp_legacy,ssbd,pclmulqdq,fma,topoext,bmi2,adx,umip,npt,ht,skinit,rdpid,arat,f16c,xsaves,lbrv,decodeassists,perfctr_nb,fsgsbase,pfthreshold,monitor,xsave,avx,aes,xsavec,nrip_save,sse4_1,abm,fxsr_opt,ibs,wbnoinvd -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /run/incus/blah/qemu.conf -spice unix=on,disable-ticketing=on,addr=/run/incus/blah/qemu.spice -pidfile /run/incus/blah/qemu.pid -D /var/log/incus/blah/qemu.log -incoming defer -smbios type=2,manufacturer=LinuxContainers,product=Incus: : exit status 1
Within the system logs of the target host I see the following errors:
Jan 08 13:42:43 redux zed[27824]: eid=136 class=data pool='incus' priority=0 err=5 flags=0x180 bookmark=444:1:0:0
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 0, async page read
Jan 08 13:42:43 redux zed[27835]: eid=137 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 0, async page read
Jan 08 13:42:43 redux kernel: Buffer I/O error on dev zd0, logical block 1, async page read
Jan 08 13:42:43 redux zed[27838]: eid=138 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27843]: eid=139 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27849]: eid=140 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux zed[27853]: eid=141 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=444:1:0:0
Jan 08 13:42:43 redux (udev-worker)[27826]: zd0: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/zd0' failed with exit code 1.
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link UP
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered blocking state
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered disabled state
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: entered allmulticast mode
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: entered promiscuous mode
Jan 08 13:42:43 redux zed[27886]: eid=144 class=data pool='incus' priority=0 err=5 flags=0x80 bookmark=428:131:0:0
Jan 08 13:42:43 redux systemd[1]: var-lib-incus-devices-blah-config.mount.mount: Deactivated successfully.
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link UP
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: left allmulticast mode
Jan 08 13:42:43 redux kernel: tap2ebb8fe1: left promiscuous mode
Jan 08 13:42:43 redux kernel: br500: port 2(tap2ebb8fe1) entered disabled state
Jan 08 13:42:43 redux systemd-networkd[1208]: tap2ebb8fe1: Link DOWN
Jan 08 13:42:43 redux systemd[1]: var-lib-incus-storage\x2dpools-local-virtual\x2dmachines-blah.mount: Deactivated successfully.
Jan 08 13:42:44 redux incusd[1903]: time="2025-01-08T13:42:44Z" level=error msg="Failed migration on target" clusterMoveSourceName=blah err="Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 -- /opt/incus/bin/qemu-system-x86_64 -S -name blah -uuid 925ad85e-9c6f-46a9-b486-d9c2b1941556 -daemonize -cpu kvm64,tsc_scale,perfctr_core,rdseed,smap,sse4a,smep,clflushopt,clwb,flushbyasid,pdpe1gb,rdtscp,rdrand,popcnt,osvw,svm_lock,vmcb_clean,sse4_2,3dnowprefetch,stibp,bmi1,misalignsse,movbe,lahf_lm,xsaveopt,svm,extapic,wdt,tce,avx2,avic,ssse3,ibpb,xgetbv1,xsaveerptr,mmxext,ibrs,clzero,vgif,cmp_legacy,ssbd,pclmulqdq,fma,topoext,bmi2,adx,umip,npt,ht,skinit,rdpid,arat,f16c,xsaves,lbrv,decodeassists,perfctr_nb,fsgsbase,pfthreshold,monitor,xsave,avx,aes,xsavec,nrip_save,sse4_1,abm,fxsr_opt,ibs,wbnoinvd -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /run/incus/blah/qemu.conf -spice unix=on,disable-ticketing=on,addr=/run/incus/blah/qemu.spice -pidfile /run/incus/blah/qemu.pid -D /var/log/incus/blah/qemu.log -incoming defer -smbios type=2,manufacturer=LinuxContainers,product=Incus: : exit status 1" instance=blah live=true project=default push=false
So it seems like the transferred zfs dataset is corrupted and the migration fails. A zpool status reports no issues, likely because the dataset no longer exists as it has been removed by incus when the migration failed.
# zpool status
pool: incus
state: ONLINE
scan: scrub repaired 0B in 00:00:00 with 0 errors on Wed Jan 8 11:00:42 2025
config:
NAME STATE READ WRITE CKSUM
incus ONLINE 0 0 0 mirror-0 ONLINE 0 0 0
nvme-WDC_WDS200T2B0C-00PXH0_2114G2448408 ONLINE 0 0 0
nvme-WD_Red_SN700_2000GB_21330M800791 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-WD_Red_SN700_2000GB_23202J800063 ONLINE 0 0 0
nvme-WD_Red_SN700_2000GB_24370G800955 ONLINE 0 0 0
errors: No known data errors
However, attempting to migrate a vm (with and without the state configuration applied) in a stopped state seems to work without issue.
Hosts are identical and are using incus, kernel and zfs packages from the zabbly repo. Details are:
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.1 LTS
Release: 24.04
Codename: noble
# uname -a
Linux noctua 6.12.8-zabbly+ #ubuntu24.04 SMP PREEMPT_DYNAMIC Fri Jan 3 17:01:02 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
# zfs version
zfs-2.2.7-1
# incus --version
6.8
# cat /proc/cpuinfo | grep "model name" | tail -1
model name : AMD EPYC 7302 16-Core Processor
Does anybody have any ideas on how this data corruption issue can be resolved?