I am toying around with Incus in my homelab.
I have NVMEoF + ROCE\RDMA set up, 40gb fabric, two incus hosts, over a mixture of ConnectX 3\4 cards.
I am using software network bridging for the virtualized nic.
Cluster communication is happening over Ipv6 over the software bridged nic on each host.
On rare occasions, if I do incus move testws01 --target tiny01 -s vmvg00 - the VM instance will go into an “error” state until the migration completes….not always but sometimes. Not even usually.
I don’t see anything in dmesg, journalctl -f, incus info testws01 --show-log, or other logs like /var/log/incus
It seems rare-ish, and happens mid migration.
I am guessing it has something to do with an IO timeout somewhere? Or cluster communications? Storage fabric and cluster comms are happening over the same ethernet port.
The systems are intel N150s so they are a little stressed doing this.
Are there any adjustments to like timeouts anywhere for IO or network comms?
I am not seeing any pings dropped.
Priority Flow Control is set up on the Cisco Nexus for the storage vlan. I see storage traffic being bucketed properly, but I did not set any like maximum limits on storage IO anywhere, in qos or in incus.
If it falls to error state, it does recover after the migration is done.
The ERROR state probably is just from QEMU being too busy dealing with the live migration to answer our requests.
What we’d want to make sure is that the VM itself isn’t negatively impacted by this.
Can you maybe SSH into the VM and confirm that it’s generally responsive except for the half second or so of actual cut over?
We could of course paper over this by catching the communication issue and ongoing migration and reporting RUNNING, but we’ve generally tried not to do that too much as it may cover up other more real issues.
The VM does go unresponsive shortly after this message appears, can’t ping, remote. It is Windows 2025. It could be QEMU doing it to cut back load and stop something bad from happening, it’s like it’s getting paused when it happens, and it unpauses after the migration finishes.
It seems pretty rare. I moved the cluster communication to the same nic, vlan as storage. I was running intra-cluster communication over a software bridge on a nic virtual function. Thus far it has not popped up again and I have tossed the system between two hosts with shared LVM storage 10 times or so. It may be that the load on the system is too high, especially with the blocks moving over a software bridge on 40gbit nics on Intel N150, NVMEoF storage. The storage NIC is backed by SRIOV VFs, and probably hw offloaded more than the bridge.
I was just getting some comfort with Incus in a lab environment, wondering if I would choose it over something like Proxmox in a real production environment with a support contract. It seems pretty solid!