Anyone any thoughts on what might cause this … experienced during a container restart … container doesn’t restart, then neither will incus …
Jun 09 10:05:50 lite kernel: INFO: task incusd:2939 blocked for more than 362 seconds.
Jun 09 10:05:50 lite kernel: Tainted: P O 6.12.25-v8-16k #1
Jun 09 10:05:50 lite kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 09 10:05:50 lite kernel: task:incusd state:D stack:0 pid:2939 tgid:2847 ppid:1 flags:0x0000000c
Jun 09 10:05:50 lite kernel: Call trace:
Jun 09 10:05:50 lite kernel: __switch_to+0xf0/0x150
Jun 09 10:05:50 lite kernel: __schedule+0x38c/0xdd8
Jun 09 10:05:50 lite kernel: schedule+0x3c/0x148
Jun 09 10:05:50 lite kernel: grab_super+0x158/0x1c0
Jun 09 10:05:50 lite kernel: sget+0x150/0x268
Jun 09 10:05:50 lite kernel: zpl_mount+0x134/0x2f8 [zfs]
Jun 09 10:05:50 lite kernel: legacy_get_tree+0x38/0x70
Jun 09 10:05:50 lite kernel: vfs_get_tree+0x30/0x100
Jun 09 10:05:50 lite kernel: path_mount+0x410/0xa98
Jun 09 10:05:50 lite kernel: __arm64_sys_mount+0x194/0x2c0
Jun 09 10:05:50 lite kernel: invoke_syscall+0x50/0x120
Jun 09 10:05:50 lite kernel: el0_svc_common.constprop.0+0x48/0xf0
Jun 09 10:05:50 lite kernel: do_el0_svc+0x24/0x38
Jun 09 10:05:50 lite kernel: el0_svc+0x30/0xd0
Jun 09 10:05:50 lite kernel: el0t_64_sync_handler+0x100/0x130
Jun 09 10:05:50 lite kernel: el0t_64_sync+0x190/0x198
Update: happened again on a different node, possibly associated with;
ovsdb-server[2598]: ovs|00034|raft|INFO|Transferring leadership to write a snapshot.
ovsdb-server[2598]: ovs|00035|raft|INFO|rejected append_reply (not leader)
ovsdb-server[2598]: ovs|00036|raft|INFO|rejected append_reply (not leader)
ovsdb-server[2598]: ovs|00037|raft|INFO|server 18e5 is leader for term 6
And
kernel: eth1: renamed from veth87299ae4
kernel: veth6f6b37eb: renamed from physn1QH2O
incusd[8150]: time="2025-06-09T12:39:03+01:00" level=warning msg="Could not find OVN Switch port associated to OVS interface" device=eth-1 driver=nic instance=kuma interface=vethf03f89dc project=default
This is bad news because it locks the entire node needing a reboot.
Ok, I’ve not been able to reproduce it “exactly”, however it tends to happen when I’m changing an instance, either the profile (which implicitly changes the network) or adding / deleting interfaces, predominantly OVN networks and interfaces. Had it 4 times today so far … node won’t even close down, needs a power button.
The error seems to indicate a kernel level lock with a mount operation.
You can look at ps fauxww on the system and look for processes in D state to get an idea of what’s currently stuck, but when you get those kind of messages, there’s nothing that Incus can do about it, it’s no longer running until whatever syscall it’s stuck on finally completes.
Sure, makes sense … but it doesn’t happen (at all) when I’m not messing with interfaces, which I’m doing through Incus. Whereas I appreciate the problem is at a lower level somewhere in the kernel, it would appear to be being triggered by Incus’ behavior … maybe the way or order in which incus is adding and removing things.
So this is triggered by stopping / starting instances with static IP addresses where the address is on an OVN network. Changing an address can seem to be the issue as one tends to restart an instance after setting it’s address as static.
Avoid restarting containers with pinned IP’s, problem vanishes
Only use dynamically allocated non-pinned IP’s, problem vanishes
Can you make a reproducible with some minimal OVN network setup? That is, likely create a VM in Incus, in there do the minimal OVN setup, and make the problem appear.
In theory I could. This would however involve spare equipment I don’t currently have available and time that I seem to be running out of. I’d like to thank everyone who’s helped thus far but as far as OVN is concerned I’ve now run out of road.
I had a fully functional network yesterday morning that I’d spent 4-5 months building and getting to grips with, and a fully automated reproducible deployment system. A number of operational issues remained that I was working out solutions for, but I was fairly confident I knew what I was doing. Hey, 30+ years with ipv4, I must’ve learnt a little, right?
Then it stopped. Something obviously changed, but I don’t know what. After spending all day trying to recover (and failing) including redeploying, I came to the conclusion that I just can’t afford to run on a platform that is as fragile as the one I seem to have built and despite the investment and how much it kills me to bin all the work I’ve put in.
Within a matter of hours of making the decision I had a workable alternative that seems to provide all the facilities I was hoping for from OVN, “and” solves all the outstanding operational issues that seemed difficult to solve with OVN. My important / “live” stuff I’ve migrated off onto non-clustered hardware and I’ll be working on the alternative over the next few days … I doubt it’s going to take too long to implement.
I think from the get-go I mis-interpreted what OVN is for vs the alternatives and just went down the wrong route (for me).
So the original problem is not linked to OVN, nor is it linked to a particular kernel version or any kind of exotic configuration. The problem is relatively random and happens (sometimes) on container restarts. It only happens on instances with overridden network parameters.
I’m updating this because I thought it was just happening on overridden IP addresses, but it’s not, if you override the mac address it can also cause the problem. i.e. restarting a container with an overridden address.
I know technically this looks like a ZFS error, but it’s being caused by something very specific that Incus is doing “differently” when restarting containers with overridden network keys. (specifically the ip address and hw address)
It’s the worst kind of problem as it locks the entire machine, which means a full reboot. The knock-on is that because it locks Incus, it then locks nodes in the rest of the cluster if they try to update anything.
If there’s no fix, any thoughts on mitigation would be much appeaciated. If I don’t override the hardware address then it looks like I’m getting random mac addresses which change each restart. This “seems” to be new behaviour, but either way I can’t keep anything stable atm … either I’m constantly renumbering, or I override stuff and risk lock-ups …
My config is now very simple, local VLANS with my own DHCP server, remote is a L2 tunnel using GRE over Wireguard. Previous config was OVN with managed networks. Running across half a dozen machines with three different kernel versions. (Debian 6.6 => 6/12/25) I’m running an Ubuntu Node now just on the off-chance that makes a difference …