Changing IP's causing system lockups

Ok, so I struggled with this for a while to try to narrow down the fault but I seem to have circumvented it which makes me think I’ve kind of isolated the problem.

The problem

System locks up rendering incus locked. Only solution is a system reboot.

[61263.129520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[61263.129522] task:incusd          state:D stack:0     pid:12347 ppid:1      flags:0x00000005
[61263.129528] Call trace:
[61263.129530]  __switch_to+0xe0/0x120
[61263.129538]  __schedule+0x37c/0xd60
[61263.129542]  schedule+0x64/0x108
[61263.129545]  grab_super_dead+0xec/0x160
[61263.129552]  sget+0x150/0x208
[61263.129556]  zpl_mount+0x134/0x2f8 [zfs]
[61263.129769]  legacy_get_tree+0x38/0x70
[61263.129774]  vfs_get_tree+0x30/0xf8
[61263.129779]  path_mount+0x410/0xa90
[61263.129783]  __arm64_sys_mount+0x1e8/0x2d0
[61263.129787]  invoke_syscall+0x50/0x128
[61263.129793]  el0_svc_common.constprop.0+0x48/0xf0
[61263.129797]  do_el0_svc+0x24/0x38
[61263.129801]  el0_svc+0x38/0xd0
[61263.129806]  el0t_64_sync_handler+0x100/0x130
[61263.129810]  el0t_64_sync+0x190/0x198

Configuration

Linux Kernel: present in 6.6 and 6.12 branches (ARM64/RPi5)
Running on ZFS / OVN.
Incus: 6.13 (Zabbly)

The problem happens (sometimes but not always) when changing the IP address of an interface on an OVN network, while the instance is running. It seems “more” likely (but this is purely anecdotal) in the event that you make a mistake like trying to re-use an IP that’s already in use or trying to assign an address that’s not within the network range for the network.

Example;

#!/usr/bin/env bash
incus config set $1 volatile.apply_template=create
incus config device override $1 eth-1 ipv4.address=$2
incus restart $1

This script has proven very problematic and seems to have ~ 1 in 10 chance of locking up the system.

If on the other hand I do this;

#!/usr/bin/env bash
incus stop $1
incus config set $1 volatile.apply_template=create
incus config device override $1 eth-1 ipv4.address=$2
incus start $1

It would appear I am lock-up free!

I appreciate this looks like a ZFS problem, and indeed similar looking lock-ups have been documented with docker and ZFS, however it would appear that Incus is doing something that specifically triggers the problem.

Any thoughts, or should I just always make sure the instance is stopped before I mess with the config?

Instance config modifications include a mount + save backup + unmount of the config volume. So that’s most likely the zpl_mount being stuck.

So yeah, Incus is the one that requested the mount based on the operations you’re doing, but it’s not really an abnormal operation to ask for ZFS to mount a dataset :slight_smile:

Mmm, it’s starting to look like the problem was introduced in ZFS 2.3.1 and it “might” be fixed now in “2.3.2” … just wondering whether I should try to get back to 2.3.0 or pull a bleeding edge 2.3.2 … :frowning: … it’s becoming a but unusable as it is …

@stgraber , it would “appear” I’ve isolated the problem. Although I accept this looks like a ZFS problem, it only happens when restarting instances with overridden addresses on an OVN network. If I remove all my overrides and make everything dynamic, two things happen;

  • instances tend to restart within 1-3 seconds (as opposed to 10-30 seconds)
  • I don’t see any lockups

A voice in my head tells me the issue is due to a network problem but I’m not seeing anything else wrong and I have live traffic on it with people who would complain if there were a problem.

Could it be there is a timing / deadlock issue with restarts under certain conditions (i.e. pinned IP’s on an OVN network) that is triggering the ZFS problem … in which case is it fixable?

My current solution it to make everything dynamic (!) while trying to avoid restarts …

I’ve now set up a local nameserver that takes transfers from Incus and I’ve successfully set up zones for my ranges, and that all works great, albeit I’m not thrilled about adding dns lookup overhead to some stuff … but I seem to have an issue which is leaving me scratching my head, which is how to get my nice new dynamically driven DNS “into” all the containers.

The only solution I seem to be able to come up with which doesn’t have harmful side-effects is to host a few DNS servers on static IP’s on host nodes, then make these IP’s visible via container static routes.

Running the DNS instances in containers leaves me needing static’s in containers again, and the containers are intentionally (mostly) single interface so’s not to complicate OCI images, so an additional local host bridge isn’t going to be good …

Is there a better trick for containers to get access to Incus’ dynamic DNS that I’m not aware of?

Also …

I notice Incus zones seem to roll their serial number every 2 mins … is there a reason for this, or is it just for people who’ve not enabled xfr notify?

So the solution in this context really does seem to be avoiding IP pinning.